Reimplementing a broken search engine

Enzo Díaz

Nov 30th, 2023

Enabling users to search and explore your data in a way that is relevant to them might sometimes be a challenge, especially if it was never processed or normalized. That was the case when we worked on improving the search features of this marketplace for boats.

Last year we started working with this client to modernize their platform, add new features, and improve its performance. The search engine is one of these pain points that, at some point, we needed to address. Said search engine was implemented in Apache Solr, and it presented some concerns:

They were running a self-hosted instance of Solr that, to make things more complicated, was running on the same server as the Ruby on Rails application. More often than not, the Solr process overloaded the servers and disconnected. This caused the developers to be in charge of maintaining it and making sure that it doesn’t prevent users from searching boats.
The Solr that was running was a very old one, and some of its features had been deprecated a long time ago, leading the system to security vulnerabilities and errors.
The existing implementation was not taking advantage of essential search features like fuzzy matching or full-text search.

We wanted to make sure we addressed these issues while also providing a more modern search experience, so we opted for re-implementing the search engine from scratch using Elasticsearch. There are several reasons:

We needed a managed solution, instead of self-hosted, and Amazon has been supporting it for many years.
Updating Solr to a newer and more stable version meant adapting a big part of the code, if not redoing everything.
Part of our team was familiar with Elasticsearch and had used it in other projects.

The starting point

In this project, the information comes from different sources: third-party APIs, partner websites, and direct user submissions. Over time, and supported by a rather legacy codebase, this led to duplication of records. For instance, the same boat manufacturer may be added multiple times with slight variations in the name.

If you want to show boats for a certain manufacturer, say “Beneteau”, you want boats that belong to misspelled versions of Beneteau to be included in the results. You can’t do this with your regular SQL queries, although there are workarounds.

A potential solution is to merge duplications and associate incoming data with existing records before attempting to create new ones. However, it is important to acknowledge that, given the nature of the project, there will always be a margin of error or duplication.

Why is having duplicated makers a problem?

Let’s find out with an example. The application has an extensive catalog of boat manufacturers and all of its content was either imported from third parties or manually entered. You want to check how many boats for “Sea Ray” are available in your listings. But, since sources differ so much from each other, you end up discovering that instead of one “Sea Ray”, your table contains these values:

Sea Ray
sea-ray
Sea/Ray
SeaRay boats
ray sea
and so on…

When a visitor types “sea ray” in our site’s search bar, we want them to match the original manufacturer name and any variation of it, so they’re able to retrieve every product for the said brand, and not just the one that matches exactly.

To achieve that, we can use a combination of indexing and matching features that Elasticsearch provides.

Setting up the index

At index time, we want our records to be normalized and saved as cleanly as possible. We will create a custom analyzer and leverage Elasticsearch filters to tokenize manufacturer names. First thing, we create the boat_analyzer:

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "boat_analyzer": {
            "filter": ["lowercase", "asciifolding", "boat_stop"],
            "char_filter": ["ampersand"],
            "type": "custom",
            "tokenizer": "standard"
          }
        },
        "filter": {
          "boat_stop": {
            "type": "stop",
            "stopwords": ["boats", "new", "offer"]
          }
        }
      }
    }
  }
}

Let’s dive deeper into what’s going on here:

We’re creating a boat_stop filter. It will ditch words that we don’t need to match. For instance, in “NEW FooBar boats” we can ignore the “new” and “boats” part of it.
We’re laying on the standard tokenizer, which is Elasticsearch’s default tokenizer, and it’s enough for our use case. This ensures that hyphens, parenthesis, bars, and other types of characters are removed from the field, e.g. converting “Foo—bar/” to “Foo bar”. It also provides grammar-based tokenization.
By applying the lowercase and assciifolding filters, capital letters get down-cased and non-ASCII characters converted to their ASCII equivalent. By default, the standard analyzer (not to be confused with its tokenizer counterpart) includes down-casing, but since we’re recreating it and adding new things on top of it, we must add lowercase to the list.

Mapping the fields

Now that we have created our analyzer, we can map our fields and make use of it:

{
  "mappings": {
    "properties": {
      "manufacturer": {
        "type": "keyword",
        "fields": {
          "analyzed": {
            "index_phrases": true,
            "type": "text",
            "analyzer": "boat_analyzer"
          }
        }
      }
    }
  }
}

This snippet will set up two fields for our manufacturers, an analyzed one (where we use our custom boat_analyzer) and a keyword one (for exact matches a.k.a term matches, in case we ever need it). The text-type field is the one we are going to use for full-text searches.

To create the index, we simply need to use the Create Index API. That is, we need to setup our index and give it a name by sending a curl command to where the Elasticsearch instance lives, like this:

curl -X PUT "elasticsearchinstance.com:9200/boats_index?pretty" -H 'Content-Type: application/json' -d <OUR_SETTINGS_AS_DESCRIBED_ABOVE>

Once created, we should add our data to the index. The native way to do this would be sending a POST request to /boat_index/_create/<_id> with each boat we want to add. If you’re using a wrapper gem like Searchkick, this is a lot easier since the gem will handle that for you. We just have to run Boat.reindex and all our boats will be indexed.

If curious about the outcome of our settings, we can use the Analyze API to check how the effective tokens are saved. The Analyze API is useful when we want to test strings and make sure our analyzer is working as expected.

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/boats_index/_analyze' -d '{"text": "fOo /-@Bar", "analyzer": "boat_analyzer"}' | json_pp

# [foo, bar]

Here we are asking Elastic to use the boat_analyzer we previously created and to tokenize the “f0o / -@Bar” string. As a result, it returns the normalized tokens “foo” and “bar”. These tokens are generated for debugging purposes and are not saved.

Querying

All our data is now normalized, we no longer have “sea/ray”, “seA-Rar” or “SEA/-@ray” in our index. All these variations were converted to “sea ray”. But now new questions arise:

What if the user doesn’t know how to properly write the name? Or worse…
What if our database still has misspellings, e.g. “seaa ray”?

Elasticsearch is well-equipped for the task.

The third step is to create a query that can handle cases like the aforementioned. This should be a good starting point:

Note: We will not delve into the specific query filters, match parameters, or complete matching strategies in this article. The goal is to briefly show you how to set up brand name matching for one field.

{
  "query": {
    "bool": {
      "must": [
        "mandatory conditions"
      ],
      "should": [
        "optional conditions"
      ],
      "filter": []
    }
  }
}

We will place the query for the manufacturer in the “must” array, so Elastic knows that, for a document to be selected, it has to match that field. We could create our query like this:

{
  "match": {
    "manufacturer.analyzed": {
      "query": manufacturer,
      "analyzer": "boat_analyzer"
    }
  }
}

This query is fine if we want the engine to return a broad set of results where any of the terms in the query matches, but it has some caveats:

Documents containing “Bar Foo” and “Foo Bar” will have the same score, hence “Bar Foo” could be wrongly positioned at the top of the list.
Searching for “Foo Bar” will also return “One Bar” and “Foo Two”.
We still have not solved the initial problem (“What if the user types it wrong?”).

Let’s fix this, one problem at a time.

Documents containing “Bar Foo” and “Foo Bar” will have the same score, hence “Bar Foo” will be wrongly positioned at the top of the list.

To address this, we can improve our query and split it into two, inside a should clause:

{
  "bool": {
    "should": [
      {
        "match_phrase": {
          "manufacturer.analyzed": {
            "query": manufacturer,
            "analyzer": "boat_analyzer"
          }
        }
      },
      {
        "match": {
          "manufacturer.analyzed": {
            "query": manufacturer,
            "analyzer": "boat_analyzer"
          }
        }
      }
    ]
  }
}

“should” acts as a logical OR. Since it has no sibling “must” clauses, one of its two child conditions must be met. The match_phrase query matches documents where the terms appear in the same order as the original text. “Foo Bar” will match “Foo Bar”, but not “Bar Foo”. Also, “Foo Bar” will match with the second part of the “should”, since both terms are present in the document.

On the other hand, “Bar Foo” will only be considered a match for the second clause, and therefore receive less score. So this is great and achieves what we want. “Foo Bar” will have a higher score and be positioned at the top of the list.

Searching for “Foo Bar” will also return “One Bar” and “Foo Two”

A simple solution to this is to set the operator parameter in the match query to “AND”. It is “OR” by default.

{
  "match": {
    "manufacturer.analyzed": {
      "query": manufacturer,
      "analyzer": "boat_analyzer",
      "operator": "AND"
    }
  }
}

This means that all terms must be present in the field for elastic to consider it a match.

We still have not solved the initial problem (“What if the users type it wrong?”)

To fix this, we can make use of the fuzziness feature, which adds a tolerance of 0-2 spelling errors per term. Elastic calculates the similarity between terms using the Damerau–Levenshtein distance, which takes into account the number of edits a term needs (replacing, removing, inserting, and transposing characters) to become the original string, e.g. “SeaRay” needs 2 edits to turn into “Scarab” (replacing the “e” and the “y”). Allowed values are 0, 1, 2, and AUTO.

{
  "match": {
    "manufacturer.analyzed": {
      "query": manufacturer,
      "analyzer": "boat_analyzer",
      "operator": "AND",
      "fuzziness": "AUTO:3,7"
    }
  }
}

Here, we’re telling Elastic that:

Terms with lengths 0-3 must match exactly.
Terms with lengths 3-6 can have 1 edit.
Terms with lengths equal to or greater than 7 may have two edits.

With this configuration, a user searching for “fooe barr” will be able to get the -properly written- “Foo Bar” boats.

Next steps

Tuning your search engine according to your needs is a process that requires patience and a lot of experimenting. There is always something more to do or improve.

I hope this helped you to understand the basics of Elasticsearch, and gave you some insight into what’s happening behind the scenes of that search bar you use every day.

I also encourage you to try out Elasticsearch as it is a very powerful tool that enhances your website UX significantly. Hope you enjoyed it!

Need help?

Get in touch