comment 1

Named Entity Annotations in Elasticsearch

This blogpost will show how you can use Elasticsearch to extract Named Entities and store them as annotations.

There is a really nice plugin written by one of the main Elasticsearch developers Alexander Reelsen:

https://github.com/spinscale/elasticsearch-ingest-opennlp

This plugin wraps the library OpenNLP and allows to extract named entities from text.

If you want to follow the tutorial, please install the plugin according to the instructions.

Elasticsearch version 6.5 released a functionality called Annotated Text. The functionality is provided as a plugin. Please follow the installation instructions.

Annotated Text adds an annotation layer on the same position as the actual token. This is a crucial feature for Textmining. This feature allows you to query for previously detected and labelled entities.

This blogpost will explain how you can build an ingest pipeline that directly converts the named entities extracted by the Ingest OpenNLP plugin into the annotation format.

Here is the ingest pipeline definition that you can use to generate the Annotated Text format:

PUT _ingest/pipeline/annotate-opennlp-pipeline
{
  "description": "A pipeline to do named entity extraction and create annotations",
  "processors": [
    {
      "opennlp": {
        "field": "my_field"
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
          Map map = ctx.entities;
          Iterator it = map.keySet().iterator();
          while (it.hasNext()) {
              String annotationType = (String) it.next();
              List annotationValues = (List) map.get(annotationType);
              Iterator annotationIterator = annotationValues.iterator();
          		while (annotationIterator.hasNext()) {
          			String annotation = annotationIterator.next();
          			String escapedAnnotation = annotation.replace(" ", "+");
          			String annotatedText = ctx[params.field].replace(annotation, "["+annotation+"]("+escapedAnnotation+"&"+annotationType+")");
          			ctx[params.field] = annotatedText;
          		}
          }
        """,
        "params": {
          "field": "my_field"
        }
      }
    },
    {
      "remove": {
        "field": ["entities"]
      }
    }
  ]
}

Let’s test this pipeline:

POST _ingest/pipeline/annotate-opennlp-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "my_field": "Kobe Bryant was one of the best basketball players of all times. Not even Michael Jordan has ever scored 81 points in one game. Munich is really an awesome city, but New York is as well. Yesterday has been the hottest day of the year."
      }
    }
  ]
}

>>>>>> Response <<<<<<
{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "my_field" : "[Kobe Bryant](Kobe+Bryant&persons) was one of the best basketball players of all times. Not even [Michael Jordan](Michael+Jordan&persons) has ever scored 81 points in one game. [Munich](Munich&locations) is really an awesome city, but [New York](New+York&locations) is as well. [Yesterday](Yesterday&dates) has been the hottest day of the year."
        },
        "_ingest" : {
          "timestamp" : "2019-04-16T10:24:16.352Z"
        }
      }
    }
  ]
}

Looks good! So now you can use this pipeline to index documents into Elasticsearch directly.

Let’s create an index with the appropriate mapping and store the document there:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "annotated_text"
      }
    }
  }
}

PUT my-index/_doc/1?pipeline=annotate-opennlp-pipeline
{
  "my_field" : "Kobe Bryant was one of the best basketball players of all times. Not even Michael Jordan has ever scored 81 points in one game. Munich is really an awesome city, but New York is as well. Yesterday has been the hottest day of the year."
}

And now we can search and highlight the annotations:

GET my_index/_search
{
  "query": {
    "terms": {
      "my_field": [
        "Kobe Bryant",
        "persons"
      ]
    }
  },
  "highlight": {
    "fields": {
      "my_field": {
        "type": "annotated"
      }
    }
  }
}

>>>>> Response <<<<<
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "my_field" : "[Kobe Bryant](Kobe+Bryant&persons) was one of the best basketball players of all times."
        },
        "highlight" : {
          "my_field" : [
            "[Kobe Bryant](_hit_term=Kobe+Bryant&_hit_term=persons&Kobe+Bryant&persons) was one of the best basketball players of all times."
          ]
        }
      }
    ]
  }
}

Have fun and let me know how it goes!

1 Comment so far

Leave a Reply