You don't need big data, FOOL!

Ever since journalists started writing the words big & data together, with large letters and quotation marks, everyone who owns a computers and a couple of Excel spreadsheets think they have a big data or they are using big data in some sort of ways.

Working as a consultant means for me that my main role is to advise and help clients in understanding their needs. It also means that I should
guide them, explain and simplify complex algorithms and technologies that they hear about from the web.



I am writing this post because of an accumulation of frustration that I encounter every time the following conversation takes place with me:
Person :  Hi, My name is ********, I am a ******** .(usually a tech guy)
Me : Nice to meet you. What are you currently working on?
Person :  I am working on a very innovative project, it is actually a big data project.
Me : Hummm, and what kind of big data you are working wih exactly?
Person :  Well,.. I am dealing with millions and millions of rows in my data base and I am using big data to analyze it.
Me : ..........
This conversation usually ends with my ears shutting down and my brain thinking about a "polite" thing to say to walk away.

This kind of conversations became very recurrent, and after thinking about it for a while, I decided to write something for you, so before you start talking about big data and annoying the heck out of myself and other people around you, here are some points I'd like you to fully understand :


So What is big data?

For those of  you still unfamiliar with the concept, big data is a technological concept related to data but not exclusive to the volume of data ( the terme big is not only referring to the size).




 Big data is usually associated with the size of data sets, but not always with its types, variations and its velocity.


Having "millions and millions" of records in a SQL table stored on a banged-up server bought from the 1990's  is definitely going to be a very difficult thing to work with.

The volume of data is very important. Having an infinie number of records in your database is a good way to track your customers of to analyze stock market. But data redundancy is something to avoid.

Having a strategic way of eliminating duplicates and capturing only the data you actually need and know how to analyze is way better than having everything, but not knowing with to do with it.




Server logs, instant messages, real time web-app interactions...etc are a way of getting really fast streams of data that needs to be stored and analyzed almost instantly.

This is related to how well you wrote your application and how well your server is running.
Using Hadoop or whatever other big data thing everyone is talking about, won't make your slow FORTRAN code any faster. (Sorry for the FORTRAN users, I couldn't think of any older programming language)

"Companies brag about the size of their data-sets the same way fishermen brag about the size of their fish"
So here are some things you should think about doing before jumping into your next Multi-node Hadoop clusters on Amazon Web Services:

Upgrade your hardware: 

  • More RAM memory 
  • Better CPU
  • More storage space



Upgrade your software: 

  • See performing open source software that are better than commercial ones
  • Upgrade your existing commercial software
  • Optimize your own software or codes you use to query and manage your data



Upgrade your team:


  • Hire better talent that can optimize the workflow and write better code
  • Consult with companies that offer data management services and advice for structuring your current data storage


"Good data is better than big data"

Make a plan


  • Make a strategic plan for your company's data growth over the next 5 years.
  • Start an in-house R&D project to try out all the trendy software you hear about and make some benchmarks to migrate if you find better alternatives








The final advice is a general one:

Whenever you hear someone says they uses the big data, automatically translate that in your mind into "Bullshit"

So to say it out loud, You are not Facebook or Google or Twitter or..whatever, you will be probably better off with a good SQL database, well maintained and highly optimized, with a top team that knows what kind of data to store, how to analyze it and what is the relevant information they can get from it.

Finally, I want to share this picture that sums up this whole debate in a funny way :



Popular posts from this blog

Mathematical Symbols in LATEX

Analyzing Twitter data with R (part 3: Cleaning & organizing the Data)

Linkin Park: Analyzing the causes of death of Rock band members