comment 0

When simple is better: The boolean similarity module

I had a lecture about Information Retrieval at university. That’s the field that studies search engines. In the first few classes we learned about the history and evolution of language models that are used for search engines.

The most basic and simple form of a language model for a search engine is the boolean model.

When you build up an inverted index, you need to collect all words that make up this index. And then you need to store the so called postings list, a list of all document ids where these words appear.

For the boolean model you just store in a table which term appears in which documents. That’s it. No more information needed.

The scoring function for the boolean model is also fairly easy. Either a term appears in a document or not. So you have 2 possible scores: 1 and 0.
There are no dependencies, no fancy math. For each term the scoring function can decide whether a document has a match or not. It becomes super fancy when you use 2 query terms. Because then the maximum score for a query and a document can be – wait for it – 2.

Obviously that didn’t lead to very good results. This was years before something as sophisticated as PageRank or Google has been invented.

People really had to think about how to find the needle in the hay. How to find relevant documents.

So the next generation of language models was born… the probabilistic model – we’re still in the 70s.

Shortly after TF-IDF was born and search was finally somewhat sophisticated. By the way, IDF was invented by a female computer scientist calledKaren Spärck Jones in 1972.

Long story short: The boolean language model for information retrieval sounds like the most basic archaic formula that has long been replaced by more advanced techniques.

Advantages of the boolean model

What could possibly be good about this simple boolean model? There are actually some scenarios where this is exactly what you want.

Let’s assume you want to use Elasticsearch but you really just need to filter by certain terms. Elasticsearch can do all the fancy analysis and stemming but after that you just want a list of all documents matching that particular query.

In that case boolean similarity is exactly what you need.

Second usecase: you want to rank your documents exactly by the number of matched query terms in a document. Nothing more. If you enter 2 query terms, the documents with 2 hits should appear before those with 1 hit but that’s it.

Let’s take the following scenario: you build a job search engine. You have lots of facets on the left side, where you can choose the role, the experience level, the salary, the size of the company, the location and so on.

Which job is a better match? The one that ranks higher because of some weird term statistics in that field or the one that matches most of your search criteria?

Well that was easy to guess.

What if you need to search logs? You want a list of all events that match a specific error message from your application. You usually don’t need a fine-grained score that looks like this: 4.5342487

5 would be fine. Your message has 5 words and they all appear in the document, so let 5 be the top score.

It is often rather cryptic to debug or understand TF-IDF based scoring. Sometimes you really just want to know how many query terms matched.

How to configure boolean similarity in Elasticsearch

PUT test
{
  "mappings": {
    "doc" : {
      "properties" : {
        "content" : {
          "type" : "text",
          "similarity" : "boolean"
        }
      }
    }
  }
}


PUT test/doc/1
{
  "content" : "It's a match!"
}
PUT test/doc/2
{
  "content" : "Great match!"
}

GET test/_search
{
  "query": {
    "match": {
      "content": "great match"
    }
  }
}

>>>>>>> Response <<<<<<<<
{
  "hits": {
    "total": 3,
    "max_score": 2,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 2,
        "_source": {
          "content": "Great match!"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "content": "It's a match!"
        }
      }
    ]
  }
}

Elasticsearch now uses BM25, a TF-IDF based similarity scoring module by default. That works ok for most usecases. But for a few either very simple usecases or those where you want the number of your query terms to be the highest possible score the boolean similarity module actually works better.

I wouldn’t go as far to say the boolean similarity module is more performant, but there is definitely less work to do at index and search time when you don’t need to calculate with floating numbers and you don’t care about term statistics.

Good luck and let me know if you have any questions!

Leave a Reply