Catalog Search 2/22/2018

How Search Algorithms work: The Levenshtein Distance

A good product search in online stores is tolerant when it comes to typos, it finds matching products, categories and other contents even if there is no 100% match. To make this happen, there is an algorithm that compares the search term to searchable content or character strings. The difference between the search term and the character string, e.g. a product name, is measured with the Levenshtein distance.

The Levenshtein Distance explained

The Levenshtein distance describes the minimum number of character edits necessary to turn on character string into another. Possible edits are insertions, deletions and substitutions. The Levenshtein distance is named after the Russian mathematician Vladimir Levenshtein (1935-2017), who developed it in 1965.

Examples

The Levenshtein distance between "Levenshtein" and "Lewenstein" is 2.

0. Lewenstein
1. Levenstein (substitution of v for w)
2. Levenshtein (insert h)

Two identical terms, such as the homographs match (game) and match (a candidate for matrimony), have a Levenshtein distance of 0.

Levenshtein Distance and Fuzzy Search

Configurations of product search in the commerce field usually don't use the term Levenshtein distance, but offer options to set up a fuzzy search. Your way of fine-tuning here is to set the fuzzy search's sensitivity. The more sensitive it is, the smaller the tolerated Levenshtein distance that will still be recognized as a match.

For example, in IntegerNet_Solr we provided a configuration option for the fuzzy search's sensitivity. If the entered value is 1, the search's sensitivity is so high that no fuzzy search is performed. The lowest value we allow is 0. In that case the sensitivity is so low that basically any product is regarded as a match for any term. From a customer's point of view in your online store that is not helpful.
Therefore, a sensitivity between 0,7 and 0,9 is usually recommended.

The general rule is: The higher the sensitivity of your search, the lower the tolerated Levenshtein distance.
The lower the sensitivity of your search, the higher the tolerated Levenshtein distance.

Fuzzy Search in Magento

Magento's default search cannot perform a fuzzy search. It searches the product catalog exclusively for a 100% match between the search term and a character string within the product data. If there is no 100% match, an empty search results page is shown. For users of that online store, this empty page is no help.
For that reason extensions and services for an improved product search are a valuable addition to your store. They reduce the potential frustration of an insufficient catalog search, and lead customers to the products they are looking for.

Limitations of Fuzzy Search

If a search term shall return a product that has not even a remote match for the search term in its attributes, you can configure the fuzzy search to be so insensitive that it will eventually show this product as a result, too. However, with all the added irrelevant search results, the search is no longer useful.
For such a case synonyms are the right means. More about this in a separate blog post.