Address Record Clustering — Fuzzy Comparison with Multithreaded C++

The issue of fuzzy matching business records has been considered in prior posts already — thanks for warm comments from all the readers. This time, will develop a multithreaded C++ program (no GUI, just command prompt), which will scan through a whole list of address records and produce output file containing id of the record and id of the cluster the record belongs to. If the record does not have any cluster assigned to it (i.e. it is a stand alone address) — will not show the record in the output.

Why would such clustering be needed? Suppose you’re tasked with identifying all clients living in a particular address. Will take address to be a tuple of (Street, City, State, Zip Code). You’re given a list of client-address pairs. Simplistic approach would be to count the number of clients by address. But same address can be spelled differently — thus the solution is map each address to an id of the cluster it belongs to, and group on that id when doing the count.

Fuzzy matching is done in the following way. Every address is compared with every other address in the file — if we have n addresses in total, will have n * (n-1) / 2 comparisons. Only street address of the record is considered, no city, postal code etc (easily modified). The address is stripped of punctuation marks and white space. Digits are extracted into one separate part, in the order the appear in the actual address, the rest (i.e. letters) are put into their own part. If numerical parts between two addresses differ, we automatically qualify the addresses as distinct. If the numerical parts are identical but address (string) parts are different, we compare using the following algorithm. Take the length of longest common subsequence (LCS) and divide it by the average of l1 and l2 (lengths of the address strings). The score is thus guaranteed to be between 0 and 1 — the higher the score, the more similarity.

LCS algorithm is pretty demanding in terms of computational resources and it will consume much of program’s runtime. The idea is to use multi-threading capabilities offered by C++11 standard (MinGW Posix compiler for Windows). At a very high level, each address pair can be compared by a designated processor, so the program is embarrassingly parallel. No pair of addresses is compared more than once. Expected speedup is linear in the count of processors utilized — you obviously should not exceed the count of threads your processor can actually support. The highest number for desktop systems would be eight (8).

The difficulties are at the clustering step.

Suppose a1 and a2 are two address records which are determined to be similar (so fuzzy matching score > threshold value). If both a1 and a2 do not have clusters assigned to them, will generate new surrogate id for cluster and assign a1 and a2 to this new cluster. If a1 has been assigned to cluster but a2 was not, will assign a2 to a1’s cluster — the reverse is easy.

The hardest part is having both a1 and a2 already belonging to clusters c1 and c2. Then, we have cluster fusion, i.e. will pick either c1 or c2 and substitute instead of the other. If we pick c1, all of c2 will become c1.

The solution is straightforward for a sequential program. But if you’re trying to parallelize it, it is pretty difficult to do it in the right way — will have to cheat. Here is why it’s difficult. Suppose a1 and a2 are being worked on by processor p1 and a3 and a4 by processor p2. Further, suppose these are the cluster assignments (a1, c1); (a2, c2); (a3, c2); (a4, c3). p1 makes decision to make c2 = c1, so every record assigned to c2 must be modified to be assigned to c1. At at the same time, p2 makes decision to make every record in c2 = c3. The problem is obvious.

One solution is to create a map from address record to cluster id and lock the map while working on it. This is the solution in this implementation. The trick here is that, most of the time, the program will actually be doing LCS, and will be locking the map only if the pair of records under comparison is qualifying for a match.

 

The code can be easily compiled on MinGW build for C++11 standard (posix version for Windows). Here is the compilation command:

g++ EntityFuzzyComparator.cpp -lpthread -std=c++11

By default, the program looks for file input_address.txt which should contain two tab delimited columns, first column is integer serving as id of the address and second column is the address itself. The output is recorded into out.txt file and composed of two columns — first is id of the address (taken from the input file), second is the id of the cluster it belongs to. Cluster ids serve no purpose other than to group the addresses.

As before, the extension on the file is ‘doc’ only for upload purposes — the actual file is C++ code.

EntityFuzzyComparator.cpp

 

Advertisements

~ by Monsi.Terdex on May 1, 2014.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: