Avoid Hadoop: a beginner's checklist for big data reporting

A lot of companies are trying to get value out of "big data". Most go through a period of panic and flailing around using ill-adapted tools. In consulting engagements or during the sales process, the same themes tend to come back.

You might have an idea something's wrong when your web logs are filling up your database, reports patched from SQL queries take over minutes to generate and your team is now considering a web-scale NoSQL data store.

It's common for great developers to waste weeks or months on a flawed approach. I'm hoping this list helps some of my readers avoid that fate.

(If all of this already sounds old-hat to you, go read Memory Architecture Hacks from which I've taken the Hadoop criteria below) 

0: Avoid or delay using Hadoop

Hadoop is only useful when you don't need real-time results, data is too big to ever fit in memory and the map-reduce algorithm is well-suited to the task, e.g. the output of the map is not bigger than the size of input. Unfortunately many people spend weeks learning this new framework, without realizing it forces them to solve their problems in unnatural ways.

One special case that must be mentioned is search. Just because MySQL or Postgres aren't up to the task should not have you reaching for Hadoop if Lucene will do the trick. And by "do the trick" I mean it will sip resources on a single machine, returning results faster than a dozen machines painstakingly configured with hdfs and Hadoop.

There's a similar rush towards using various NoSQL data stores even though flat files can be perfectly adapted.

1: Shrink it, Cache it, Fudge it

Ask how accurate the reports need to be.

Sample: sometimes 5-10% of the data can get a good approximate result.

Save computations: one of my first reports was total visits to a website. An extra table saved totals for arbitrary ranges, replacing an expensive sum with a handful of selects and a bit of code overhead.

Fudge it: while visits are easy to sum, uniques seemed to be a harder problem. Fortunately Bloom filters are useful when counting set intersections. You can set the trade-off between accuracy and size.

2: Buy more RAM

It's often cheaper to buy more RAM than use engineering cycles. $200-1000 sometimes speeds up your reports by keeping your entire data in memory instead of doing disk IO. In-memory database are much faster than disk.

3: Avoid or speedup IO

Store the memory mapped representation of your problem. Create smaller pre-processed records by stripping fields you won't use. Gzip the files. Save indexes.

On the hardware side, there's SSDs, RAIDs and networked RAM.

4: Cram it into memory

Remove unused fields, replace large strings with indexes, or use sparse matrices - whatever is needed to fit everything into RAM.


The above aren't universally applicable, although they should solve the majority of the pain most teams encounter when first venturing into the world of big data.