## Just how big IS big data?

In our industry we generate lots and lots of data but often the casual way in which we refer to quite huge data volumes demonstrates that many people don’t fully appreciate what it really means.

As humans we aren’t actually that good at conceptualising large numbers. Most people intuitively know that a million is pretty big but don’t really understand just how big it is. As we start talking about billions or trillions then the labels become more and more abstract. Add tech terms like terabyte and petabyte and it can start to feel like arcana.

A frequent technique to help here is to relate the data size to something more comprehensible. So for example we hear that a certain size of data is so many times larger than every book stored in the U.S. Library of Congress. Or if written to DVDs we try and compare how many miles high the stack would be. Unfortunately most of us don’t own a Library of Congress or an exceptionally comprehensive DVD catalogue with which to compare so these analogies still remain a little abstract.

So let’s look at some big numbers related to data and see if we can express them in terms more applicable to the daily life of most people in our industry.

A frequent technique to help here is to relate the data size to something more comprehensible

Spreadsheets: everyone loves spreadsheets or at least everyone uses them and most don’t constantly scream from the pain. So let’s say someone asked for a data set that comprised 1billion rows of data, just how manageable would that be?

Looking at our impressions logs I see that 1 billion records would amount to approximately one terabyte of storage. So the first problem would be just storing the data on say your laptop hard drive, most are still around the 0.5-0.75 terabyte mark. But it isn’t that unusual to request a larger drive in a new laptop so we can get a laptop with a 1.5 terabyte hard drive that can hold our data. For this one report. And not much else.

We will have a problem in that our laptop will likely only have around 8 gigabytes of memory so will at any single point in time only be able to use around 0.8% of the data. Let’s hope the reports we want to run aren’t any more complicated than taking simple sums across the data set.

But we need get the data into memory in the first place. Now let’s read the spreadsheet from the hard drive. By going for a larger hard drive it’s probably a little slow but let’s be (very) charitable and assume we can read data off the disk at 100 megabytes per second which means we’ll read the data in about 3 hours. But this is an important report, we can be patient.

Plus maybe we don’t want to just have our computer scan through the data. It’s a spreadsheet after all. Let’s think about it in terms of the rows. We have 1 billion rows of data and let’s say we can see 50 rows on our screen at any time. Let’s also assume that we can click through 2 screens of data each second. That means that if we don’t take any breaks and plough ahead we can scan through the data set in just over 115 days. Yes, days. I’m not sure we can be quite that patient.

This example is of course contrived (and a 1 terabyte spreadsheet would never be practical) but it highlights a few points. With smaller data sets just asking for data that can be sucked into a spreadsheet makes sense but at large scales the important thing is to express the need in terms of the problem you are trying to solve or the question into which you wish to gain insight.

If you ask your data team (you do have one right?) for a dataset that equates to a billion (or any other unreasonably large number) of rows you are asking the wrong question. If they give it to you then you have the wrong data team. Instead it needs be a collaborative effort where you work together to determine the things you are trying to learn and the data team then work to use the smartest techniques to produce the smallest data set that gives you that answer. Sometimes that will require simple aggregation, other times it will get a little fuzzier as you need to start making decisions based not on the absolute sum of a huge data set but instead on a statistically selected sample.

Data is an asset but unlike other assets simply collecting more of it is not automatically adding proportionally more value. As the volumes grow so must the sophistication of the approaches and tools used to extract meaningful business insights from the vast quantities of numbers and letters.