# Data correction with fuzzy string matching

We are all familiar with Google’s “Did you mean” correction for misspelt search terms.

But how does it work? There is a great chapter by Norvig in the book ‘Beautiful Data: The stories behind elegant data solutions’, that discusses and implements a basic spelling corrector, using only a few lines of python code. The chapter is available here on Norvig’s site.

At the heart of any solution is a sense of distance, or similarity, between words (strings in computer code). One of the earliest, and most famous, word distance functions is the so-called Levenshtein Distance (V. Levenshtein 1965) where the distance between two words is defined as the minimum number of character edits needed to convert one string to the other. The edits allowed are inserts, deletes and substitutions of characters in a word. As an example the distance between “clrus” and “clarus” is 1, only one insertion of a character, the letter ‘a’, is needed to make the words equal; the distance between “clxrus” and “clarus” is also 1, only one substitution of a character, the letter ‘a’ for ‘x’, is needed to make the words equal; and finally, the distance between “claraus” and “clarus” is also 1, only one deletion, the second letter ‘a’, is needed to make the words equal. The distance between “clraus” and “clarus” is 2, two substitutions, or one deletion and one insertion, are needed to make the words equal. The edits for the distance need not be a unique combination of edits, the distance is just the minimum number of edits necessary.

Throwing a bone to the math folk, if you have not spotted it already, the Levenshtein distance function satisfies the triangle inequality and forms a metric.

Programmers may well be reminded of the diff utility often used in source code version control, it relies on a similar idea of the minimum number of inserts and deletes except the units are not characters in a word, but complete lines of text in a file.

We have made use of string similarity functions in several areas of data import at Clarus. One basic example, for illustration, is fuzzy matching of the column names in the import of data from CSV (and Excel) files.

When CSV data is produced by a program, it tends to be reliable and well formed, but if the CSV data is produced ‘on-the-fly’ by a human via Excel, there is a reasonable chance for mistakes in the column heading names, the order of the columns, and the format of the data (most notably the date format).