Bioinformatics to Data Mining: Adapting Protein Interactome Java Software for General Graph Visualization

Traditional financial data reporting involves heavy use of spreadsheets and charts, the latter being of bar or line type – never a scatterplot (that’s a tacit taboo – what are you smarter than everybody else?). As data mining becomes more scientifically oriented (see Jessica Kirkpatrick’s transition from astronomy to data science http://womeninastronomy.blogspot.com/2013/01/datascience.html), we’re seeing a departure from simplistic ways of visualizing data – i.e. tables and primitive charts – to more sophisticated approaches (scatter plots, box-whisker charts, graphs.)

Recently facebook released its graph search which is a reflection of larger trend: graph data structure is making it into mainstream consciousness. Graphs (set of vertices and edges) are traditionally associated with scientific/mathematical data models. You can represent  a network of computer nodes or other types of routing by graph (too nerdy and complicated). Rarely would you see an accountant or a financial analyst have a need to work with graphs. But it seems that, under the growing pressures of globalization and talent competition for jobs, there is an increasing interest in using graphs to make a more illustrative presentation of the data (Edward Tufte, aha). Sometimes it’s the only way to adequately represent information though.

Unfortunately none of the mainstream software packages yet offer a robust graph plotting feature. Certainly Excel doesn’t and I haven’t seen anything like this in R yet. However, scientists have been using graphs for a long time. So, since I come from bioinformatics, I figured, why don’t I use some of the software developed by bioinformatics community (for free, mind you) and adapt it to business world (where everything has a price tag)?

In bionformatics, there is a study of proteome – all the proteins in cytosol (and nucleus) form one large set with multiple interactions (the notion of interaction is varied, it could be activation of one protein by another, etc.) So, this naturally yields itself to graph representation. Thus bioinformaticians (computer scientists converted to biologists) like to visualize proteins as verteces and interactions as edges. Some proteins will form clusters – that’s the value added by graphing them in this form (clustering could be discerned from the raw data but it’s not straightforward and/or convincing.)

So, onto the real business world then? Well, let’s say you have set of clients and you’re looking at their weekly spend for a particular year. You then sort them in descending order of total volume for the year (Excel pivot) and identify top 90% (or whatever the natural break in the data is.) You then create correllogram to see which timeseries correspond to which (i.e. the weekly spend by some members can strongly correlate (positively or negatively) with weekly spend by other customers and these findings may be totally unintuitive. This way you can select pairs of customers if they correlate above certain threshold (two vertices, one for each customer and an edge for strong correlation). You then build out your customer network/clusters. Easy?

The question is, what is the software that will allow to do it quickly with no fees, installation hickups, technical manuals, all that gunk? Below is a zip file (I gave it a non-zip extension to upload it) which contains source code for the Java project. You’ll need Java jdk installed, that’s for sure. The source code is included. Compile the source code. The input file “out.txt” should be ASCII (text) with four columns:

Name of one vertex

Name of another vertex

Interaction type (just put 1 for now)

Interaction strength (between 0 and 1)

 

Here is an example:

out.txt

A    B    1     0.9964
A    C    1     0.9896
A    D    1     0.9878
E    D    1     0.9848
F    G    1     0.9838
G    H    1     0.9836
H    F    1     0.9800
F    I    1     0.9795

 

That’s that. Simply start the program and enjoy the view.

javac Interactome\ViewerApplet.java

java Interactome.ViewerApplet

Interactome_example

Using bioinformatics graph software (interactome protein visualization) for data mining

The graph above is interactive – you can hold verteces, drag them, move entire clusters, highlight specific clusters, etc. Real fun with Java graphs at no cost. The source code is so simple, you can probably translate to C# or VB.NET in no time. I’ll send you $10 if you tell me where else does this software appear.

Nice ha? Your reaction.

Interactome.zip

Advertisements

~ by Monsi.Terdex on March 29, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: