Cordra’s APIs enable users to issue queries to search across the managed digital objects based on certain criteria.
As of Cordra v2.1.0 search requests can be sent with a “queryJson” parameter. The value of the parameter should be a JSON object. Matching objects are those whose content matches all the JSON provided in the “queryJson” parameter.
Example:
{
"queryJson": {
"name": "foo"
}
}
This will match any Cordra object the content of which has a top-level property “name”, the value of which contains the token “foo”. Deeper structure within queryJson can be used to match properties below the top level.
The remainder of this section discusses instead the more general “query” parameter. We describe via examples the query syntax to follow to retrieve desired results from queries.
Cordra uses the Lucene query syntax for search. Details of that syntax can be found at Lucene Query Syntax.
The examples below demonstrate the query syntax for fields in the following Cordra object that represents metadata about the book “Tess of the D’Urbervilles”.
{
"id": "test/72d1c8508991c7aa0a22362de8574f9c4a0fd28e7ac5bfb4002522b1b7aabafa",
"type": "Book",
"content": {
"title": "Tess of the D'Urbervilles",
"description": "Tess Durbeyfield is driven by family poverty to claim kinship with the wealthy D'Urbervilles and seek a portion of their family fortune.",
"author": {
"firstName": "Thomas",
"lastName": "Hardy"
},
"genre": [
"Victorian",
"Tragedy"
],
"publishers": [
{
"by": "James R. Osgood, McIlvaine & Co.",
"date": "1891"
},
{
"by": "Penguin Classics",
"date": "2003"
}
],
"language": "English",
"displayDateTime": "2022-08-08T09:00:00.000-04:00"
},
"acl": {
"writers": [
"test/xyz"
]
},
"userMetadata": {
"Foo": "Bar"
},
"metadata": {
"hashes": {
"alg": "SHA-256",
"content": "b919ebb1831b56df5ba4e3b6b649450561efb879ceacb954e0273393e6d9ad95",
"userMetadata": "424add9fc04ecc6d39b2c12ee958299e93fa55bd29f0b10cb65b2baefaeea402",
"full": "ddf3a4a1ef7bef667edb467f61ae1bb6de10795b4904a302f3401cef286c5a4d"
},
"createdOn": 1659978578983,
"createdBy": "admin",
"modifiedOn": 1659978578983,
"modifiedBy": "admin",
"txnId": 1569513092948005
}
}
The simplest type of query is one that contains one or more terms that are located anywhere in the object.
For example the above object would be included in results for the query:
Tess
Multiple terms can be combined in a space separated list:
Tess Durbeyfield Tragedy
These terms are combined with a logical OR such that the above query would match any object that contained any of these three terms.
Terms can be grouped together into phrases using double quotes. Here the query:
"family poverty"
would match to our example object but the query:
"poverty family"
would not, because the terms in the object are not in the same order as they are in the phrase query.
Note that terms will be tokenized on some non-whitespace characters. For example the query foo-bar
would match all
of foo, bar, and bar-foo as well. If you only wanted to match foo-bar
, you must explicitly wrap the query in quotes
to make it a phrase query, e.g. "foo-bar"
.
Fields allow the search to be restricted to particular properties of digital objects. The JSON schema driven portion of the object called “content” is fielded using JsonPointers. These are slash separated paths into the JSON tree followed by a colon and then the term. The following example JsonPointer terms would match the example object:
/description:family
/title:Tess
date_metadata/createdOn:2022-08-08
The terms for the properties of sub-JSON-objects are defined with slashes:
/author/lastName:Hardy
Fielded queries can also be combined with boolean operators:
/author/lastName:Hardy AND /author/firstName:Thomas
Such a query would only match an object that had both a lastName Hardy
and a firstName Thomas
.
A *
character can be used to find results where only part of the term matches:
/author/lastName:Har*
/author/lastName:H*y
/author/lastName:*y
is an invalid query in a default single-instance Cordra.
It is supported when Cordra is using a Solr or Elasticsearch backend, or when the
Lucene backend has been configured to support it; see Configuring Indexing Backend.
Fuzzy matching allows for small corrections in spelling mistakes. Here, the below incorrect spelling of Hardy will still match the example object:
/author/lastName:Hardi~
/author/lastName:Hardie~
Fuzzy queries only match terms that are different from the query by at most two characters.
In order to explicitly search for the term “Tragedy” within the array property named “genre” the underscore character is used.:
/genre/_:Tragedy
In order to search for properties on objects which themselves are in an array, such as the publishers array, e.g. search for all books with a publisher by “Penguin”:
/publishers/_/by:Penguin
To search for objects that have a value that falls between two values is called a range query. The below example shows a range query on the date field. It will match any value between 2000 and 2004 inclusively:
date_/publishers/_/date:[2000-01-01 TO 2004-12-31]
The same query but excluding the upper and lower bounds uses curly brackets:
date_/publishers/_/date:{2000-01-01 TO 2004-12-31}
Wild cards can also be used to search for anything less than:
date_/publishers/_/date:[* TO 2004-12-31]
Or anything greater than:
date_/publishers/_/date:[2000-01-01 TO *]
Note that for most fields range values are treated as text sucht that less than and greater than refer to lexicographical ordering. When a num_ or date_ prefix is included in a query key, range values are treated as numeric or as date-times respectively.
By default all properties of the content of an object are indexed as text. This has implications on the order of those values especially when considering numbers in range queries. Range queries over text fields use the lexicographical (dictionary) order of the text. For example the following numbers as text “1”, “10”, “2” are in lexicographical order. Consider the follow object:
{
"foo": 2
}
And the range query:
/foo:[1 TO 10]
The above query would not include our object in the results because in lexicographical order 2 comes after 10. It is
often useful to have numeric values sorted in numerical order 1, 2, 10 instead. To support this Cordra creates an
additional field for every JSON number it finds in the content when indexing. This additional field is prepended with
the prefix num_
. As such you can use the following range query which will include our object:
num_/foo:[1 TO 10]
This additional field is indexed as double precision floating point number. Note that you do not need to indicate anything in the schema to get numeric fields indexed this way. All numbers in the JSON get this extra field automatically.
When the value of a JSON property is a string with the ISO 8601 date-time format:
2022-01-05T08:30:00.000-05:00
then an additional field is created, prepended with the prefix date_
. The date_
fields can
be used to ensure the punctuation does not lead to tokenization, and for range querying; for example:
date_/foo:[2022-01-01 TO 2022-01-05]
will match the preceding value.
date_
fields will be created for fields which only have a year-month-day, or only year-month-day and hour:minute;
fractional seconds can be omitted or included; the time zone can be omitted or without colons or can be “Z”; instead of
the ISO-8601-specified “T” a space can be used to separate the date and the time. (Range queries need to use the “T”.)
Additionally, the fields date_metadata/createdOn
, date_metadata/modifiedOn
, and data_metadata/publishedOn
are created.
This can be achieved by performing a wildcard range query from any value to any value. The query below will return all
objects that have a property /language
regardless of the value:
/language:[* TO *]
Cordra managed metadata of a digital object is also managed as JSON. That metadata
is a sibling of content
within
the Cordra object. Properties within metadata can be searched by property name by prefixing it with “metadata”:
metadata/createdOn:1562866891119
metadata/modifiedOn:1562945123652
metadata/createdBy:admin
metadata/modifiedBy:admin
metadata/txnId:1562945123643011
If hashes have been turned on for this type of object those can also be searched on:
metadata/hashes/full:58848eeda8472a14f4c5fb709aa96094409018b0e623baf7c94c991ea3811f15
Some parts of the metadata can be searched with special field names:
Search by type:
type:Book
Search by id:
id:"123/test"
Search by the user that created or modified the object:
createdBy:admin
modifiedBy:admin
Creation and modification timestamps:
The two fields objcreated and objmodified contain the timestamp of the object converted into human readable format yyyyMMddHHmmssSSS. Note that this field does not contain delimiters. Delimiters can result in tokenization of the string which can then be challenging to search on:
objcreated:yyyyMMddHHmmssSSS
objmodified:yyyyMMddHHmmssSSS
If the object contains userMetadata, as it does in this case, it can be searched with the “userMetadata” prefix:
userMetadata/Foo:Bar
If explicit acls have been added to the object those can also be searched as JSON with the “acl” prefix.
For example searching for all objects that have given explicit write permission to test/xyz
:
acl/writers/_:"test/xyz"