Big Data Vs. Sampling

The new usage of big data is not the first time that we worked with predictions of how large groups of people will behave. After all, here in the United States we have a presidential election every 4 years, and we love to predict that event to death. We'll update our predictions after every speech and practically every campaign ad. "Candidate A pulled ahead in the polls today by 3 points, indicating that Candidate B is on the ropes!"

Researchers aren't taking the time to poll every single American of voting age whenever these events occur. They're only polling 1,100 people randomly selected. With that (relatively) small set of data, they're extrapolating how the rest of the public feels and is likely to vote.

At a glance, those numbers seem ridiculous. How can anyone predict what millions of people are going to do based on 1,100 people's answers? The answer is that you get within 3% of what the larger group would choose if your sampling is with a truly random set of subjects. Continuing to survey more people in the initial run does little to increase the accuracy of the results.

Authors Viktor Mayer-Schnoberger and Kenneth Cukier of Big Data are careful to add caveats, of course. You have to make sure your survey questions for your sample group are set to be answered with a simple "yes" or "no", "true" or "false". You can't have any essay questions on there. Also, using a non-random sampling of individuals for the test to lead to greatly distorted numbers.

Nielsen Ratings

The first time I heard this concept of sampling the habits and opinions of people to represent a larger group was back in college during a media class. The teacher said that Nielsen relied on this method of gathering information for all of its projections on how popular a show might be. So, if Nielsen says 5 million people watched your favorite show last night, then they are close to correct in their numbers.

My wife and I participated in a Nielsen survey several years ago, and it was quite interesting to think how our viewing habits would be projected to such a large viewer base during analysis. It certainly gives you a lot to think about. Should I watch Archer or PBS? Well, if I want to appear more cultured, then I'm going to try for PBS.

For the week that we participated, we were very conscious of our decisions, and we even changed our behavior on occasion to tell a better story about our viewing habits.

When we collect big data, we can be less self-conscious about each point of information. For better or worse, we have the capacity to create a much more complete picture by capturing close all of the data.

Depth of Data

Perhaps one of big data's biggest benefits over sampling is the depth of what information is available.

Consider the advertising process on Facebook. If you want, your business can advertise to women who have the following characteristics:

  • are in their 20's
  • are fans of Rhianna and Taylor Swift
  • live in Atlanta, GA
  • are engage to be married

From these few selections, you know that you're targeting women who are probably involved in planning a wedding. You know their tastes in entertainment, and you have a rough idea of where they might have their wedding -- or at least where they could purchase supplies for the wedding.

That sort of information doesn't come from a snapshot of data. That is the sort of in-depth information gathered over time from people interested in having conversations with friends online.

Amazingly enough, any business or individual with a message to spread to Facebook users can target groups of people this easily. After all, this isn't a terribly expensive process.

What Do You Think?

Now that big data provides more opportunities for individuals and companies to sort through data, do you like the thought of receiving uniquely targeted messages? Does it feel like an invasion of privacy, or is it a welcome shift from mass marketing?

photo credit: Vox Efx via photopin cc