Data Analysis Toolkit: Fuzzy String Comparison in VBA

So, analyst, you’ve heard your boss tell you – match this client using address, right? Or how about – “…well, we have their names and dates of birth, would that do?”

Ha ha, you know what that means, don’t you? Names could be misspelled, addresses may spell “Road” as “Rd” and all that good stuff? You’re up for another nightmarish data extraction and manual scroll through thousands of records, maniacal attention to every single detail and then saying that “I’m reasonably comfortable with results”, until somebody checks your crosswalk (i.e. map from original address to ‘cleaned-up’ version), only to find out that you missed half of the matches? Sweet ha?

Well, to alleviate your pain, I’ve designed some free and easily available fuzzy string comparison. I don’t know of any other solution to this problem, but I admit – I haven’t looked. If Microsoft or some other gurus have already figured this out and made it freely available in VBA – a quick no-fuss no-payment module, copy-paste the function and use it, then this post is just for your entertainment.

But I doubt such a simple solution is easily available out there. Now it is.

So, how do we go about fuzzy string comparison? Well, first we lower case everything in the string – that’s sounds reasonable enough. Then we remove all non-alpha-numerical characters. This will already take care of double spaces, extra commas, etc. Can we look at character distribution? Let’s say, you count the number each character shows up in the first string and then subtract from that the count of times the same character shows up in the second string. If both strings are equal, your total count is 0. If they’re not equal, there are differences somewhere, and your count may be negative or positive. Then, scroll through the entire array and add up the absolute value of the differences. Divide the total count of these values by the sum of lengths of both strings. Voila!

Let’s say we have two completely disparate strings: “aaaaa”, “bbbbb”. Then the frequency for the first string is (5, 0) while the frequency for the second one is (0, 5). Taking their differences we have (5, -5). Taking absolute differences and adding them up we have 10. Now divide 10 by sum of 5 and 5 to get 1. This stands for 100% dissimilarity, as expected. Subtract this dissimilarity score from 1 to get similarity score.  Neat?

Now, of course there is problem with this approach – what if I have “aaabbb” and “bbbaaa”? These are two completely different strings yet the similarity score produced by the above will be 100%. So we need to incorporate some way of looking at the actual sequence. It should be understood that this example is unlikely to be encountered in the real world data. Chances are, if your strings are actually different, their frequency table will be different as well.

The simplest way I see is to scan both strings left to right and increment the counter if characters are equal. If one string is shorter than another, scan until the end of the shorter string. Then divide the count by the length of the longer string. For example, a = “aaaa” and b = “aaaabbb”. The count is 4 divided by 7 or 0.571428.

Now, I know this is not perfect – but in the business of fuzzy matching – nothing is (hmm, ok Mr. Analyst, so how can we judge your work?). No one can claim a perfect fuzzy matching algorithm, so you’ll have to live (and suffer) through the consequences of using whatever algorithm you decide on. In this case, I’m weighing both scores – the score for frequency match receives a higher weight. This produces “relatively good” results for me.

You can easily use the function fuzzyCompare in the module below for your Excel or Access – the import is trivial. I’ve used it to match clients on id and then picking out the maximal match score >= 0.9 on very imperfect addresses – worked for me, then again, nobody scrutinized my data. As before, the file below is given a doc extension to enable upload – the actual extension is bas for Visual Basic – this is extension recognized by VB importer in Excel and Access. At some point, I plan to work on longest common subsequence algorithm and incorporate it into this fuzzy match.

FuzzyCompareModule.bas

Advertisements

~ by Monsi.Terdex on April 19, 2013.

5 Responses to “Data Analysis Toolkit: Fuzzy String Comparison in VBA”

  1. That worked wonderfully for our in-house client/supplier company creation form. We have duplicates appear regularly. This should prevent that. Credits in my code…

  2. Hi Guys

    I am new to VBA Excel and not that much equipped with the use of the function. I might be the first one who gonna ask you this question. I am looking for a macro with help me compare the fuzzy match from 2 worksheets. I downloaded your code and use it in the module of my macro. I have correct data in sheet1 and closely matched data in sheet2. I want the merged data in sheet 3. The data in sheet 3 should show the fuzzy mataches with the correct match in orderly manner. It will help me to get the clear picture

    Thanks in advance..

  3. Hello Monsi, Thanks for posting your fuzzy compare algorithm. Works great, very helpful!!! Awesome 🙂

  4. Hello, would like to know how to associate your codes with a Access database. Let’s say Table A and Table B. Table A has user name, address, City, and Zip. Table B also has user names, address, city, and zip.

    Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: