comment 1

Information Extraction with Elasticsearch – How to enrich your data

We know that data is the new gold. The quality of data is super important and can really make a difference for any website or online business. Nowadays many online businesses aggregate data from various sources and build a search engine for it. They earn money with clicks and referrals.

The sources can vary a lot in content and quality. It’s usually not easy to find a uniform datamodel for these different sources and normalize them.

Some sources provide more metadata than others. The easiest way to normalize different datasources is to use the fields or keys that are common to all sources. Obviously that means that a lot of machine readable data will be discarded and the filters and aggregations/facets you can provide to your users will just contain the bare necessities.

To add real value to your search engine you should be able to provide many filters especially if you don’t own the content.

Example: Fashion website

Let’s assume you are building a fashion website and plan to make money with referrals.

You have 3 different sources. 1 of them does not provide the material of the garments in machine readable format (XML, JSON etc.). All you get is the raw description of the product. Which many times contains information about the material (cotton, polyester, leather, wool, etc.)

It would be great if you could use this information and automatically enrich the products with the material information in machine readable format.

Guess what: it’s not that hard if you use Elasticsearch and know a bit about the variety of natural language.

Automatic data enrichment process

Prerequisites

First of all you need to analyze your data. I’d recommend to use the Kibana Dev Tools to do that. Postman or the Terminal are also ok but less comfortable.

You need a list of all possible values your field can have. In our case that would be a list of all possible materials your products can have.

For starters it does not need to be a complete list, but that’s the goal.

Query Building

We start with the first term “cotton”. Now you have to build a query that matches cotton in the description and maybe also in the title of your document.

GET products/_search
{
  "query": {
    "multi_match": {
      "query": "cotton",
      "fields": ["title", "description"]
    }
  },
  "_source": false,
  "highlight": {
    "fields": {
      "title" : {},
      "description" : {}
    }
  },
  "size" : 100
}

Start with a basic query. Since we’re interested in analyzing the context of our query hits, we need to enable the highlighting feature. And we also want to look at a decent sample dataset, so we should set size to 100 (or more would even be better).

Analyze the results

Then you need to manually check the results of your query. Does cotton always have the meaning of the material or not?

Watch out for negations, ambiguous words, proper names, etc.

If you find any examples of a wrong context, like a negation or a phrase like for example “goes well with a cotton dress” then you need to write an exception rule for that.

Write exception rules

I haven’t used the “must_not” block of bool queries in Elasticsearch very often but now you really need it.

You need to extend your query now to a bool query and add all exception into the “must_not” block.

GET products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "cotton",
            "fields": [
              "title",
              "description"
            ]
          }
        }
      ],
      "must_not": [
        {
          "multi_match": {
            "query": "goes well with cotton",
            "fields": [
              "title",
              "description"
            ],
            "type" : "phrase"
          }
        },
        {
          "regexp": {
            "content.standard": "not? cotton"
          }
        }
      ]
    }
  },
  "_source": false,
  "highlight": {
    "fields": {
      "title": {},
      "description": {}
    }
  }
}

Now you need to re-analyze the results and check if there are more exceptions to add. And yes, it is an ongoing process.

Once you feel confident about your results you can move on to building the enrichment pipeline.

Build the data enrichment pipeline

For actually enriching your documents you can use an ingest pipeline. Your Elasticsearch node needs to have the “ingest” role.

Then you can create an ingest pipeline that will add a field (material) and a value (cotton) to your documents.

This pipeline will be stored in the cluster state and stay there as long as your cluster does not get rebuilt. So it’s a good idea to automate that process at some point.

# Create ingest pipeline
PUT _ingest/pipeline/enrich-cotton
{
  "description": "Enrich documents with material cotton", 
  "processors": [
    {
      "set": {
        "field": "_source.material",
        "value": "cotton"
      }
    }
  ]
}

Enrich your documents

And now you’re ready to actually enrich your documents with the material cotton.

For that we will use the Update_By_Query API. You will use the query you defined earlier with all the exception rules.

All documents matching that query will be passed through the ingest pipeline “enrich-cotton” and be updated automatically in that same index.

Depending on the size of your index this might take some time.

If it does not terminate or any other problems arise you can kill the update task with the Tasks Management API.

# Enrich your documents
POST products/_update_by_query?wait_for_completion=false&pipeline=enrich-cotton
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "cotton",
            "fields": [
              "title",
              "description"
            ]
          }
        }
      ],
      "must_not": [
        {
          "multi_match": {
            "query": "goes well with cotton",
            "fields": [
              "title",
              "description"
            ]
          }
        },
        {
          "regexp": {
            "content.standard": "not? cotton"
          }
        }
      ]
    }
  }
}

1 Comment so far

Leave a Reply