The Truth is Out There: A Prelude (I) – Analysis of NYC Temperature to ‘See’ Global Warming Through the Data

This blog has not been abundant with real-life datasets so far, and for a good reason – see my lyrical digression posted before. Basically, I’m reluctant to put out actual data recorded from real-life phenomena because the question is always – so what? Who really knows if such a dataset necessarily conceals meaningful patterns? And why should you, dear reader of this, try to spend hours and hours of your leisure hoping to find a pattern where none exists?

Ultimately, the arsenal developed on mock problems should be used to dissect real life data. So this time, I’d make an exception – suspend disbelief – and accept a new maxim: ‘The truth is out there’ – specifically it’s buried somewhere in the attached file, which is a compilation of atmospheric readings made by New York City’s Central Park weather station. I compiled the free data from year 2000 (URL below) to 2012.

First impressions? It’s cyclical. As you would expect, the temperature varies in a familiar fashion corresponding to the seasons. My naive hope is to detect signs of global climate change (bias)… Is it even reasonable? Well, fitting linear regression to it gives a mildly positive slope, perhaps just enough to keep the hope of a ‘looming discovery’? He he…

Well, let’s fantasize for a little while, shall we?


My second favorite quote by Einstein. What is my first? You tell me…

What data do we have? Here is the list of columns:

EST, Max TemperatureF, Mean TemperatureF, Min TemperatureF, Max Dew PointF, MeanDew PointF, Min DewpointF, Max Humidity,  Mean Humidity,  Min Humidity,  Max Sea Level PressureIn,  Mean Sea Level PressureIn,  Min Sea Level PressureIn,  Max VisibilityMiles,  Mean VisibilityMiles,  Min VisibilityMiles,  Max Wind SpeedMPH,  Mean Wind SpeedMPH,  Max Gust SpeedMPH, PrecipitationIn,  CloudCover,  Events,  WindDirDegrees.

EST is an actual day when the recording was made, so it defines functional dependency on the dataset. The ‘mean’ columns are actually arithmetic averages of min and max columns. Obviously we’re interested in looking at minimal and maximal temperatures, right? We would expect minimal and maximal temperatures to rise with time… So would the mean temperature, but it is an average, so why don’t we try working with raw data first?

Ok, here is an idea – split the dataset in two equal parts (as far as number of years is concerned) and take a look at minimal and maximal temperature by month. The result is a scatter plot in Excel which shows points (months) of two colors: one color for the first 6 years (2001 to 2006) and the other for next 6 years (2007 to 2012). Abscissa is the minimal temperature in a month and ordinate is the maximal. Your reaction…

There seem to be two natural breaks in the data: one at around 32F and the other at 47F. While there probably is a physical basis for the first one, I’m not sure about the second. As far as the physics of it goes, I’m pretty sure that due to the structural change which happens to water around 32F, there is a reason why minimal and maximal temperatures just don’t seem to stick around that point. If you know of a good physics-based explanation, please send it to me by e-mail (see ‘About’ page) and I will post it here. Moving on…

Here is some imaginative approach: split the months in two groups: those with minimal temperature above 32F and below 32F. Since the number of months in each year group is the same, you would expect the proportions to be comparable across the year groups right? Not so…The count of months with minimal temperature below 32F is 59 in total: 31 for years 2001 to 2006 and 28 for years 2007 to 2012. The proportions are 53% and 47% respectively. On the other hand, the count of months with minimal temperature above 32F is 85, 41 for 2006s group and 44 for 2012s group. Now, comparing across year groups, we’re seeing that for 2000 to 2006, we have 57% to 43% split between above 32F and Below 32F group and this ratio is exacerbated for 2007-2012 group: 61% to 39%. So, on the basis of all this, seems like the months with higher minimal temperature are sort of more likely in 2007-2012 group? Ok, bias bias, moving on…. (to be continued in the next post.)


The 'hope' inducing longitudinal view of NYC daily temperature from 2000 to 2012 (year end)

The joyful longitudinal view of NYC daily temperature from 2000 to 2012 (year end): may we hope to prove ‘global warming’ from this dataset? It would be so rapturous…


Trying to discern insight? Hoping for a ‘wow’ moment? You’re close…

Counts of months by two factors: year group and minimal tempearature (below 32F or above)

Counts of months by two factors: year group and minimal tempearature (below 32F or above)


~ by Monsi.Terdex on February 22, 2013.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: