comments 2

How to build a self-learning search engine with Elasticsearch

This blogpost will walk you through a demo that shows how you can use Elasticsearch to build a self-learning search engine.

You can apply this technique if you have a user facing UI and if you can access the webanalytics that tracks the user-interaction with your website.

And of course you need to have a running version of Elasticsearch where you can control most of the backend yourself.

There are so many different webtechnologies and webanalytics solutions out there, that I will not make any assumptions on the actual implementation that goes beyond Elasticsearch.

To make things easier to imagine, we’ll use an image search engine as an example. Only the captions that describe what’s on the pictures will be searchable. It has a simple search bar and shows the matching images below. We’re a startup and so far only 2 users have uploaded a document.

We setup an index with 1 primary shard only. For simplicity we’ll use the default mappings as it’s just a PoC.

### For testing with relevance always use 1 shard
PUT images
{
 "settings": {
 "number_of_shards": 1
 }
}

### Create a few sample docs:
### Add a counter for views and an empty array for query_terms
PUT images/doc/1
{
 "title" : "This is not a dog",
 "views" : 0,
 "query_terms" : []
}

PUT images/doc/2
{
 "title" : "This is a very big dog",
 "views" : 0,
 "query_terms" : []
}

One image shows a big dog, the other one shows something else. Our first user searches for the term “dog”. We used a very simple match query in the background.

### Search for the term dog
### Document 1 gets a high score, but it’s not what we want as a top result. 
GET images/doc/_search
{
 "query": {
   "match": {
     "title": "dog"
   }
  }
}

Response

{  
   "took":8,
   "timed_out":false,
   "_shards":{
      "total":1,
      "successful":1,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":2,
      "max_score":0.18936405,
      "hits":[
         {
            "_index":"images",
            "_type":"doc",
            "_id":"1",
            "_score":0.18936405,
            "_source":{
               "title":"This is not a dog",
               "views":0,
              "query_terms":[ ]
            }
         },
         {
            "_index":"images",
            "_type":"doc",
            "_id":"2",
            "_score":0.17578414,
            "_source":{
               "title":"This is a very big dog",
               "views":0,
              "query_terms":[ ]
            }
         }
      ]
   }
}

Oh, that’s too bad. The image without a dog gets a higher score. How can we avoid that? Let’s assume our user really loved the second hit and clicked on it.

That’s really useful feedback for us. We can assume that the document he clicked on matches his information need. This would count as positive feedback. We can use this feedback and teach our search engine that this was a good match.

The training data that we use is the correlation between the search term and the user engagement with the document. So we will collect all information related to that. We will store the query term and the document id that the user interacted with and as an option the type of interaction. That information will be stored within Elasticsearch. We will create daily (timebased) indices for that purpose and store all webanalytics there.

### Create daily indices for tracking the events on your website
### Store 1 document for each action, such as “view”, “click”, or “buy”
### If possible store the search-keyword that the user entered
### Store the _id of the document so you can reference it later
POST webanalytics_2018–12–01/doc
{
 "query_term" : "dog",
 "action_type" : "click",
 "doc_id" : "2"
}
### Store this event 5 times so we can run a nice test later

The assumption that a user clicked on a perfectly matching item after he entered a search term is of course only an assumption. There is no guarantee that there will be any correlation between the search term and the documents he interacted with later.

So we need to apply some techniques to filter out noise. Because the probability that there is a correlation between search terms and the documents users click on increases by number of users that do the same. So we can filter by some simple heuristics such as minimum frequency. Every correlation that happened less than 3 times will not be considered. Now let’s analyze our daily webanalytics index to extract the useful feedback.

Which search terms can we correlate with which documents?

### Every day you should run a background script that computes for the most popular query terms which documents match best. Use some heuristics to filter out noise. I use e.g. min_doc_count. 
GET webanalytics_2018–12–01/_search
{
   "query":{
      "term":{
         "action_type.keyword":"click"
      }
   },
   "size":0,
   "aggs":{
      "query_terms":{
         "terms":{
            "field":"query_term.keyword",
            "size":100,
            "min_doc_count":3
         },
         "aggs":{
            "top_docs":{
               "terms":{
                  "field":"doc_id.keyword",
                  "size":3,
                  "min_doc_count":3
               }
            }
         }
      }
   }
}

Let’s build a nested terms aggregation that lists the 100 most popular query terms. We will not consider query terms that were entered less than 3 times, as we don’t think they are important. For each query term we want a list with the top 3 documents that users clicked on after that. We want to make sure this did not happen by mistake, so we will only consider documents that were clicked on at least 3 times.

Response

{  
   "took":15,
   "timed_out":false,
   "_shards":{
      "total":5,
      "successful":5,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":5,
      "max_score":0.0,
      "hits":[

      ]
   },
   "aggregations":{
      "query_terms":{
         "doc_count_error_upper_bound":0,
         "sum_other_doc_count":0,
         "buckets":[
            {
               "key":"dog",
               "doc_count":5,
               "top_docs":{
                  "doc_count_error_upper_bound":0,
                  "sum_other_doc_count":0,
                  "buckets":[
                     {
                        "key":"2",
                        "doc_count":5
                     }
                  ]
               }
            }
         ]
      }
   }
}

Ok the results are no surprise to us: we gather that the top query term was “dog” and the most popular document for “dog” was #2. Now let’s use that information and populate our documents from the image search with it. Use your programming language of choice to automate this job.

### Use the results of that aggregation to update the documents that received some positive user feedback
POST images/doc/2/_update
{
   "script":{
     "source":"""
  ctx._source.query_terms.add(params.term); 
 ctx._source.views+=params.views
 """,
      "lang":"painless",
      "params":{
         "term":"dog",
         "views":5
      }
   }
}

We’re updating our 2 meta fields “views” and “query_terms” with it. Right now we just append the query term to an array. This works for now, but we might want to use painless scripting to make sure each term only appears once. For this demo we’re happy with appending.

# Check that the document was updated correctly
GET images/doc/2
{
   "_index":"images",
   "_type":"doc",
   "_id":"2",
   "_version":2,
   "found":true,
   "_source":{
      "title":"This is a very big dog",
      "views":5,
      "query_terms":[
         "dog"
      ]
   }
}

Updating the original documents can be done on a regular basis, but please don’t do it too often. From every 15 minutes to every day or every week it’s all fine.

To make things easier with the update logic your update interval should match your time-based index creation frequency. So if you update every day, use daily indices, if you update every hour, use hourly indices. Then your updates to the documents can always be incremental.

Also you can easily delete the webanalytics indices shortly after they were being consumed. Now comes the trick: We will rewrite our query to take the feedback our users provided into account:

### Now you can use a query that boosts the learned query terms if they are matched correctly
### You can combine the query term boosting with an overall popularity boosting 
GET images/doc/_search
{
   "query":{
      "function_score":{
         "query":{
            "bool":{
               "must":[
                  {
                     "match":{
                        "title":"dog"
                     }
                  }
               ],
               "should":[
                  {
                     "term":{
                        "query_terms.keyword":{
                           "value":"dog",
                           "boost":10
                        }
                     }
                  }
               ]
            }
         },
         "functions":[
            {
               "script_score":{
                  "script":""" 
return _score * (1 + doc["views"].value/100.0); """
               }
            }
         ]
      }
   }
}

This query is slightly more complex than the query we had before. We’re using a bool query so we can combine different search criteria. We use our original query inside a must clause. So this is still a mandatory requirement.

On top of that we add an exact matching term query on the learned query terms and give it a very high boost. That means: if a document has proven to be a good hit for this exact same search term it will receive lots of extra points for the scoring.

We also collected information about the overall popularity of a document. To take this into account we will use a function score query with a scripted score. We want that very popular documents can at most double the relevance score that was calculated by Lucene. To achieve that we need to normalize our popularity counter. An easy way to do that is to divide the popularity by the theoretical maximum that field can have. When we work with fixed numbers of course we need to update the script from time to time. Of course you can modify this formula according to your needs. Also it is not mandatory to use popularity as a signal to enhance your search results over time, but in many cases it doesn’t harm. When we run that query we see that document #2 now clearly leads the ranking. Which is exactly what we wanted.

Response

{  
   "took":31,
   "timed_out":false,
   "_shards":{
      "total":1,
      "successful":1,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":2,
      "max_score":3.2052352,
      "hits":[
         {
            "_index":"images",
            "_type":"doc",
            "_id":"2",
            "_score":3.2052352,
            "_source":{
               "title":"This is a very big dog",
               "views":5,
               "query_terms":[
                  "dog"
               ]
            }
         },
         {
            "_index":"images",
            "_type":"doc",
            "_id":"1",
            "_score":0.18936405,
            "_source":{
               "title":"This is not a dog",
               "views":0,
               "query_terms":[

               ]
            }
         }
      ]
   }
}

Pretty nice: we just built a self-learning search engine that takes the users feedback into account and helps us to make the search results better over time.

2 Comments

  1. Sarang Dharmapurikar

    Fantastic article! I was trying to do something similar and Googled if someone has already done it and landed on your article. Nicely written. Are you available for consulting? I am based in the US and working on a project involving Elasticsearch, NLP (gensim/Spacy), Flask.

    If you are taking consulting projects, may I know what your hourly rates are?

    Thanks,
    -Sarang

Leave a Reply