Data Analysis: Anisotropic Measurement of Unique Counts – Simulation in R

Counting unique values is a ubiquitous and very important task in business analysis. Most of the time, you’ll be given a database to query, so deriving a count of unique values for a particular attribute (e.g. client id) is not a matter of measurement precision, it’s a matter of technique because in the end, you wind up with an exact value, not an approximation.

But sometimes it’s not the case and you may have to use some kind of a proxy to derive unique count values. Generally, unique counts are not additive, so I give you a summary containing valid count of unique clients by geographic unit for a period of time, you can’t add those counts to derive correct total count of unique clients because one client could have been counted in at least two distinct geographic units (they moved at some point). But by adding unique counts obtained that way, you can get relatively close to the true count of unique values.

Why bother going through all the hoops to ‘measure’ the true unique count? Sometimes deriving true unique count is not an option because the state of the databases changed and historical data is not reproducible. Suppose some division of the business uses breakout of unique client count by geographic unit I described above, and they don’t provide total true count of distinct clients and you’re given a report 3 months old. There is no way for you to go back in time and derive the accurate count of unique clients because all the databases have changed and historical data is not available.

Suppose though, there are two dimensions, each with 10 possible attribute values. You’re given count of true unique members by values in each dimension but not overall. Your task is to derive a close measurement of total unique count. The idea then is to add up unique counts derived from dimension with a higher amount of duplication, to lower the error. Why ‘anisotropic’? In science, anisotropic implies dependence on direction of measurement. Loosely speaking, we can call the addition of unique counts to derive a total unique count to be an estimation, or a measurement, and the dimension of the unique count values to be direction. Then, depending on which dimension I choose for unique counts, I will have anisotropic measurement because my result will depend on direction (dimension).

The R simulation below illustrates just what happens. There is a total of 10 unique objects, each having dimensions B and C (A is object id). Each object has at least 3 distinct values in dimension B and no more than 3 in dimension C. Adding up unique counts obtained by grouping on dimension B will yield a closer estimate to true value of 10 than adding up the unique counts obtained by grouping on C.

r = NULL;
for (i in 1:10) {
# how many attributes in this dimension will this object have (at least 3)
j = 3 + sample(1:3, 1);
js = sample(1:10, j, replace=FALSE);
for (k in 1:j){
# how many attributes in this dimension (at most 3)
m =  sample(1:3, 1);
ms = sample(1:10, m, replace=FALSE);
for (l in 1:m){
r=rbind(r, c(i, js[k] , ms[l] ));
names(r) = c('A', 'B', 'C');
write.table(r, 'clipboard',sep='\t');
write.table(aggregate(data=r, A ~ B + C, function(x) length(unique(x))), 'clipboard',sep='\t');
write.table(aggregate(data=r, A ~ B , function(x) length(unique(x))), 'clipboard',sep='\t');
write.table(aggregate(data=r, A ~ C , function(x) length(unique(x))), 'clipboard',sep='\t');

~ by Monsi.Terdex on April 1, 2014.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: