Search Index

Content search is a special Content Fabric feature that allows the creation of search indexes based on content object metadata (including file content e.g. video tagging), and it provides a search query interface.

For information on updating v1 queries to the new version check Migrating old queries to v2

Searching an index

/rep/search

Search indexes are represented as content fabric objects, and /rep/search should be called on these objects. See the later section on Creating an Index Object for more information on how to create and manage these indexes.

Search can be invoked through the REST api, but for personal use it’s more convenient to use a command line client like elv, example: elv content bitcode rep $CONTENT_ID search

Create an Index Object

Create Fabric Object and Metadata Set Up

An index object is an object that needs two things:

  • A reference to a content type having the builtin capabilities (creating simple content type with metadata {"bitcode_format":"builtin"} will suffice).
  • Metadata having a particular field and format (which is used to configure the crawler and search engine). That metadata should contain the field .indexer.config and the format of that field should follow the following rules :
    • fabric.root.library and fabric.root.content should correspond to the library ID and content ID of the root metadata to crawl.
    • indexer.type should be equal to "metadata-text"
    • indexer.arguments.fields will contain all the fields that are searchable.
    • A searchable field inside indexer.arguments.fields should have the following format :
    "<searchable_field_name>": {
      "options": null,
      "paths": [
          "<path_0>",
          "<path_1>",
          "...",
          "<path_N>"
      ]}
    
    • <searchable_field_name> can be any string, this name will be used when querying the index for that particular field (cf. below).
    • <path_i> correspond to all the metadata paths of leaf fields to index under the name <searchable_field_name>.
    • For example :
    "synopsis": {
      "options": null,
      "paths": [
          "public.asset_metadata.titles.*.*.info.synopsis",
          "public.asset_metadata.series.*.*.info.synopsis",
          "public.asset_metadata.series.*.*.seasons.*.*.synopsis",
      ]
    }
    
    will index all the fields it can find corresponding to one of the paths and index them under the name synopsis. When doing a search, it will be possible to query a synopsis using a query string like this f_synopsis:=<keyword> (cf. below).
    • Paths can have a wildcard *, meaning that any key name will be crawled.
    • Paths are namespaces in the sense that arrays are ignored. For example: Metadata A.B and A[0].B will both be captured by path A.B.

Here is an example of a proper metadata for the Index Object:

{
  "public": {
    "name": "Index - Site Roar"
  },
  "indexer": {
    "config": {
      "fabric": {
        "root": {
          "library": "ilib2XX6yS9S8bgAeLVxDGKeoNcNVckN",
          "content": "iq__cWJC7xQ9v3rXPYMiyhRF27Bf1rj"
        }
      },
      "indexer": {
        "type": "metadata-text",
        "arguments": {
          "fields": {
            "title": {
              "options": null,
              "paths": [
                "public.asset_metadata.titles.*.*.display_title",
                "public.asset_metadata.titles.*.*.title",
                "public.asset_metadata.titles.*.*.seasons.*.*.display_title",
                "public.asset_metadata.titles.*.*.seasons.*.*.title",
                "public.asset_metadata.titles.*.*.seasons.*.*.titles.*.*.display_title",
                "public.asset_metadata.titles.*.*.seasons.*.*.titles.*.*.title",

                "public.asset_metadata.series.*.*.display_title",
                "public.asset_metadata.series.*.*.title",
                "public.asset_metadata.series.*.*.seasons.*.*.display_title",
                "public.asset_metadata.series.*.*.seasons.*.*.title",
                "public.asset_metadata.series.*.*.seasons.*.*.titles.*.*.display_title",
                "public.asset_metadata.series.*.*.seasons.*.*.titles.*.*.title"
              ]
            },
            "title_type": {
              "options": null,
              "paths": [
                "public.asset_metadata.titles.*.*.title_type",
                "public.asset_metadata.titles.*.*.seasons.*.*.title_type",
                "public.asset_metadata.titles.*.*.seasons.*.*.titles.*.*.title_type",

                "public.asset_metadata.series.*.*.title_type",
                "public.asset_metadata.series.*.*.seasons.*.*.title_type",
                "public.asset_metadata.series.*.*.seasons.*.*.titles.*.*.title_type"
              ]
            },
            "asset_type": {
              "options": null,
              "paths": [
                "public.asset_metadata.titles.*.*.asset_type",
                "public.asset_metadata.titles.*.*.seasons.*.*.asset_type",
                "public.asset_metadata.titles.*.*.seasons.*.*.titles.*.*.asset_type",

                "public.asset_metadata.series.*.*.asset_type",
                "public.asset_metadata.series.*.*.seasons.*.*.asset_type",
                "public.asset_metadata.series.*.*.seasons.*.*.titles.*.*.asset_type"
              ]
            },
            "synopsis": {
              "options": null,
              "paths": [
                "public.asset_metadata.titles.*.*.info.synopsis",
                "public.asset_metadata.titles.*.*.seasons.*.*.synopsis",
                "public.asset_metadata.titles.*.*.seasons.*.*.titles.*.*.synopsis",

                "public.asset_metadata.series.*.*.info.synopsis",
                "public.asset_metadata.series.*.*.seasons.*.*.synopsis",
                "public.asset_metadata.series.*.*.seasons.*.*.titles.*.*.synopsis"
              ]
            }
          }
        }
      }
    }
  }
}

Once both the content type and metadata are ready (cf. above). It’s time to create the index object using the fabric. The steps are as such :

  1. Create a new content by giving it the type hash of the content type, and the metadata prepared above.
  2. Finalize that object.

At that point, the index is empty and cannot be searched. In order to be usable, it needs to crawl (next step).

Crawling an Index Object

/call/search_update

Any time your root metadata changes, you need to recrawl again in order to update the index. To do that, follow these steps (with examples using elv):

  1. Edit the Index Object (it will give you a write token)

    elv content edit $INDEX_OBJ

  2. Make a bitcode call to search_update using the write token. IMPORTANT : The authorization token SHOULD NOT have a transaction id! It will not work if it does.

    elv content bitcode call $INDEX_OBJ search_update "" lro.txt --post=true --finalize=false

    An LRO handle will be posted in lro.txt
  3. Wait a bit, check for completion with a call to crawl_status

    elv content bitcode call $INDEX_OBJ crawl_status "{\"lro_handle\": \"$LRO\"}" status.txt --post=true

  4. Once crawl_status indicates success, finalize the object

    elv content finalize $WRITE_TOKEN

  5. Your Index Object should have been updated with an index at the file path files/indexer/content as well as crawl stats. If the file at files/indexer/content exists, your index is searchable.

    The crawl statistics at meta/indexer/stats can also give helpful debugging information if needed. Crawl task exceptions will be reported at meta/indexer/exceptions.

Search Concepts

Tantivy is a search library, similar to Apache Lucene which powers Elastic Search.

It is used to build an inverted index which allows for the fast lookup of terms in a set of text-based documents.

The inverted index maps a term like “hello” to all document ids which contain “hello”.

Tantivy documents have a predefined schema which is a set of fields that terms may belong to. For example if we are indexing movie metadata, fields may include the movie’s title, synopsis, or cast. We will probably also want to index things like content-id, hash, etc for our use case…

A “term” in the tantivy sense actually includes both the field and the term. So “title:hello” and “synopsis:hello” are totally separate terms that each have their own entry in the inverted index.

Some fields are “stored”, these fields contain terms that are “stored” in a non-inverted way. Meaning that that we have a separate map that maps from document id terms as well in addition to the inverted index which maps from terms to documents. The stored fields are presented to the end user for each matching document.

Tantivy Search Flow

The typical flow of a search on a tantivy index goes something like this. Say we have a query “hello world”, and an index with schema: {“title”, “synopsis”, “cast”}. We pass the query to a query parser which converts it to lower level tantivy syntax…

In general we want to match hello or world, not necessarily both, (though of course we would prefer better matches to be presented first), so the query parser will convert this query to “title:hello OR title:world OR synopsis:hello OR synopsis:world OR cast:hello OR cast:world”.

Each field:term entry is a unique term which maps to its own set of documents in the inverted index. Tantivy will look up each term in its inverted index to obtain a document set, and it will return the intersection of all resultant document ids. This set can be huge, so scoring of these documents is an important step for the best end user experience.

For each document id in our result set, we do another lookup in our “stored” table to pull out a set of fields to give back to the user as representing that document. Which fields we “store” is set up at indexing time. As an example, if during indexing we only specify to “store” the “title” field, our search output will just be a list of title’s that match our query.

Scoring (BM25)

Of course this set will likely be huge, so Tantivy scores documents based on the following criteria: term frequency * inverse-document frequency, and document size. Term frequency describes the number of times a term appears in the document, inverse document is the multiplicative inverse of the number of times this term appears in our entire corpus. Intuitively the inverse document frequency gives a higher weight to rarer words. Finally the document size adjusts for documents that may match more terms simply by having more terms to begin with.

If we leave out the document size value we are effectively left with the famous tf-idf scoring

Scoring changes (v2)

By default Tantivy and Lucene use a particular weight for the document size. This default weight is far too high for clip search (in particular speech to text), where I noticed that small STT entries dominated the results even against more appropriate matches.

Because this option is not yet configurable in tantivy (though it is Lucene), we made the hard decision to fork tantivy and make it so. The parameter is set at injection time in search/module.go to a smaller value.

Once the core search implementation is solid I would like to spend more time either finding the best value, passing the choice to the user, or otherwise investigate if we are missing something important when it comes to scoring. I would like to think that are use cases are special enough to warrant a fork of tantivy, but if there is a nicer solution we should switch.

Stemming (v2)

The default behavior is to stem words as they enter the index and as they are parsed in the query. This means, for example, that if we search the term “stirring” it should match terms “stir”, “stirred”, etc…

This of course gives better recall and allows for some added fault tolerance, but the stemming algorithm is not infallible and will have trouble with certain english “edge” cases. For example, the stem of “shaken” is calculated as “shaken” while “shaking” and “shaked” are correctly stemmed to “shake”. Similarly, “eaten”, “beaten”, and “forsaken” are miscorrectly stemmed. That being said this limitation does not mean that the stemming will give “worse” results than the non-stemmed search it just means certain terms might not receive the added fault tolerance.

We can disable this behavior on a per field basis by setting /indexer/config/indexer/arguments/fields/<field>/options/ignore_stemming equal to true in the index config.

Tantivy Syntax

Users of the search api can specifiy tantivy syntax in the terms parameter.

  1. Default query: Hello World matches all docs containing either “hello” or “world”, sorted by relevance equivalent to Hello OR World all queries are case insensitive and remove punctuation
  2. Phrase query: "Hello World" matches “hello” followed immediately by “world”
  3. Field query: f_title:hello (matches one term), f_title:"hello world" (matches phrase) A query limited to a certain field. (In the previous examples where no field is specified, it’s the “query parser” that determines what fields to search by default.)
  4. Boolean query: f_speech_to_text:"show me the money" AND display_title:"Jerry Maguire" Can combine queries with AND/OR.
  5. Range query: f_release_date:[1965-09-17 TO 1975-09-17} Returns docs with field entries between a lexical range, most useful for single term fields such as dates. use [] for inclusion and {} for exclusion.
  6. Boost query: f_display_title:transformers^5 f_synopsis:transformers^2 Boosts the score of matches to “trasnformers” in the display title field by a factor of 5 and in the synopsis field by a factor of 2. This makes the relative importance of the display title field to the synopsis field 5/2.
  7. NOT query: f_display_title:transformers -f_genre:drama or transformers -drama Eliminates docs from the result set which contain a given term

Migrating Old Queries to V2

Updates to Tantivy syntax (terms=)

The Tantivy library which powers Eluvio search has become slightly more stringent with their syntax since our first version. If an old query is not working with v2, and you are receiving an error suggesting “most likely bad query”, then read this section.

Reminder: the query that gets issued to the tantivy library is specified by the “terms” field in the api request.

This is an example query taken from the Roar website.

(f_asset_type:primary) AND (f_title_type:feature OR f_title_type:series) AND (f_franchise#9:bond OR f_display_title#8:bond OR f_cast#6:bond OR f_synopsis#5bond OR f_genre#3:bond) AND NOT (f_hide_from_screeners:yes)

This is what it should be

(f_asset_type:primary) AND (f_titfeature f_title_type:series -f_hide_from_screeners:yes) AND (f_display_title:bond^8 f_cast:bond^6 f_synopsis:bond^5 f_genre:bond^3)

Step by step explanation

  1. Missing ‘:’ after f_synopsis. This is likely a typo and the old tantivy version ignores the issue, but this results in the synopsis field not being searched at all!

(f_asset_type:primary) AND (f_title_type:feature OR f_title_type:series) AND (f_franchise#9:bond OR f_display_title#8:bond OR f_cast#6:bond OR f_synopsis#5:bond OR f_genre#3:bond) AND NOT (f_hide_from_screeners:yes)

  1. No support for # syntax, use the boost syntax instead (^)

(f_asset_type:primary) AND (f_title_type:feature OR f_title_type:series) AND (f_franchise:bond^9 OR f_display_title:bond^8 OR f_cast:bond^6 OR f_synopsis:bond^5 OR f_genre:bond^3) AND NOT (f_hide_from_screeners:yes)

Note: If you are finding that doing this makes f_display_title not affect the search results enough, try increasing that 8 to a higher value. Or try removing the synopsis field which should give similar behavior to before due to the bug mentioned in 1.

  1. Non existent fields will return error, get rid of them (f_franchise doesn’t exist)

(f_asset_type:primary) AND (f_title_type:feature OR f_title_type:series) AND (f_display_title:bond^8 OR f_cast:bond^6 OR f_synopsis:bond^5 OR f_genre:bond^3) AND NOT (f_hide_from_screeners:yes)

  1. NOT operator can no longer be combined using AND or OR syntax and uses the minus ‘-’ symbol now. It also cannot exist as a standalone query (inside parenthesis)

There are two options to fix this…

-((f_asset_type:primary) AND (f_title_type:feature OR f_title_type:series) AND (f_display_title:bond^8 OR f_cast:bond^6 OR f_synopsis:bond^5 OR f_genre:bond^3) AND NOT (f_hide_from_screeners:yes)) -f_hide_from_screeners:yes

…Break the query into two subqueries using the parenthesis, where the second subquery contains the NOT query. Or…

-(f_asset_type:primary) AND (f_title_type:feature f_title_type:series -f_hide_from_screeners:yes) AND (f_display_title:bond^8 f_cast:bond^6 f_synopsis:bond^5 f_genre:bond^3)

Remove all the OR’s (they are implicitly added anyway), and attach the NOT query to any one of the subqueries that are grouped via AND.

I think the second option is much tidier and would recommend this approach.

Other reminders

  1. Make sure you are targetting a v2 search node (they will be listed in the eluvio config api under the search_v2 field)
  2. Make sure your search index has been crawled by a v2 search node
  3. If you are unsure what version a search node has been crawled with, check the metadata value at /indexer/version. If the value does not exist then it’s v1 else if it’s “2.0” then we’re golden.