comments 5

Introducing a generic dynamic mapping template for ElasticSearch

Configuring a mapping for ElasticSearch is not required.
Per definition and as opposed to Solr, ElasticSearch is schemaless.
If not defined, a mapping for a type is created on the fly, based on the first document that is being indexed.
If another document that is being indexed has a different format, the mapping will be changed dynamically, if this is not disabled.
Dynamic mappings can decrease indexing speed and consume many resources such as CPU.

Let’s say, it’s highly recommended to create a custom mapping before indexing.
Dynamic mappings are not only slow, but they will also introduce mispelled field-names as additional fields.
An unconfigured mapping can also cause unwanted side-effects:
per default each String field is analyzed using the standard-analyzer and added to the _all field.
This is not a good idea for e.g. URLs. Usually you don’t want to search URLs.
So it’s better to have those fields unanalyzed.
This will also prevent your index-size from being unnecessarily blown up (index-size ~ RAM ~ $$$).

For some fields you might want an unanalyzed field AND an analyzed fields.
Especially if you want to run term-facets over those fields AND also search them.
This is extremely useful for person-names and keywords or tags.

In some cases you need language specific treatment.
It’s really easy and helpful to make use of one of the > 30 language analyzers that the ElasticSearch team has already implemented for you.

http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

Even though it’s a good idea to create a static mapping and seal it by setting dynamic to false/strict * , you’re loosing flexibility if you have to know all your data upfront.

(* dynamic:false will not allow to alter field names and its format. dynamic:strict will not allow adding any new field)

To have the best of both worlds – specific mapping AND flexible fields, I’d recommend you to use dynamic templates.
And to make the world nice and easy, I’ve created one for you, that will work for many real-world use cases.

http://www.elastic.co/guide/en/elasticsearch/guide/master/custom-dynamic-mapping.html#dynamic-templates

The only thing that you’ll have to do then, is to extend your field-names with format-specific endings, that will trigger ElasticSearch to use the appropriate mapping.

The following extensions are configured here:

*_url
*_name (multifields: *_name, *_name.raw)
*_en
*_de

So you just need to follow those naming conventions in your JSON-documents before indexing and the generic dynamic-template will do the magic for you.
If no extension is provided, the standard analyzer will be used.

After creating your index, just enable the generic dynamic mapping by running this command:

TODO: Set INDEX_NAME, TYPE_NAME and your host.

curl -XPUT 'localhost:9200/INDEX_NAME/TYPE_NAME/_mapping' -d '
{
   "TYPE_NAME": {
      "dynamic_templates": [
         {
            "url": {
               "match": "*_url",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "index": "not_analyzed"
               }
            }
         },
         {
            "name": {
               "match": "*_name",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "standard",
                  "fields": {
                     "raw": {
                        "type": "string",
                        "index": "not_analyzed"
                     }
                  }
               }
            }
         },
         {
            "de": {
               "match": "*_de",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "german"
               }
            }
         },
         {
            "it": {
               "match": "*_it",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "italian"
               }
            }
         },
         {
            "en": {
               "match": "*_en",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "english"
               }
            }
         },
         {
            "fr": {
               "match": "*_fr",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "french"
               }
            }
         },
         {
            "es": {
               "match": "*_name",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "spanish"
               }
            }
         },
         {
            "text": {
               "match": "*",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "standard"
               }
            }
         }
      ]
   }
}
'

Test with Sense

# Create index
POST /test

# Create generic-dynamic-template
PUT /test/test/_mapping
{
   "test": {
      "dynamic_templates": [
         {
            "url": {
               "match": "*_url",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "index": "not_analyzed"
               }
            }
         },
         {
            "name": {
               "match": "*_name",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "standard",
                  "fields": {
                     "raw": {
                        "type": "string",
                        "index": "not_analyzed"
                     }
                  }
               }
            }
         },
         {
            "de": {
               "match": "*_de",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "german"
               }
            }
         },
         {
            "it": {
               "match": "*_it",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "italian"
               }
            }
         },
         {
            "en": {
               "match": "*_en",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "english"
               }
            }
         },
         {
            "fr": {
               "match": "*_fr",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "french"
               }
            }
         },
         {
            "es": {
               "match": "*_name",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "spanish"
               }
            }
         },
         {
            "text": {
               "match": "*",
               "match_mapping_type": "string",
               "mapping": {
                  "type": "string",
                  "analyzer": "standard"
               }
            }
         }
      ]
   }
}

# Create test-document
POST /test/test/1
{
    "wiki_url" : "http://de.wikipedia.ord/Katze",
    "name_de" : "Katze",
    "name_en" : "cat",
    "name_it" : "gatto",
    "name_fr" : "chat",
    "article_name" : "Die kleine Katze"
}

# Search fields
# Those should match
GET /test/test/_search?q=name_de:Katzen
GET /test/test/_search?q=name_fr:chats

# This will luckily not match
GET /test/test/_search?q=wiki_url:Katze

NB: The name-field will be set to multifields using standard-analyzer and unanalyzed. You may want to change from standard to a specific language-analyzer, if you’re working with non-english text.

5 Comments

Leave a Reply