Move over cloud. Big Data is all the rage these days.
Marketers are talking about Big Data. CEOs are telling their product teams to go after Big Data. Reporters are trying to get the scoop about Big Data. Analysts have entire practices devoted to Big Data market research. It seems that nothing, not even our beloved “cloud”, is as shiny.
As a marketer for the OEM Solutions team, I found myself enjoying the discussion over the last year as “Big Data” slowly crept into my vernacular. I read some whitepapers on MapReduce, and attended presentations on Cassandra, Pig and Mahout. I found myself feeling like an expert, much like most people probably feel after watching an inspiring TED talk.
Yet, as a computer engineer, I found myself questioning my inch-deep understanding and the more worrisome confidence it inspired. Two hour long business meetings would feature a customer’s CEO talking about Big Data (sometimes thinly veiled by the more tempered term “analytics”) as the next big market for his company to go chase, and the room would nod their heads in approval because . . . everyone knew that “analytics is a 10^43 (the precise value of a gazillion) dollar market”, regardless of exactly what was being analyzed, how or for whom. No product was safe from this feature creep.
One thing in particular stuck out: almost no one I met at trade shows, conferences or even customer meetings had even basic experience gathering, mining, or refining data in massive quantities. The dichotomy between those talking about Big Data and those doing Big Data was striking, and I was a talker.
The marketer in me saw the potential to be two inches deep and give tantalizing elevator pitches. The geek in me saw the chance to gain much needed expertise and help customers navigate this emerging subject.
The challenge: How do you learn about Big Data?
Big Data is shiny and new; it’s not like they teach this stuff in schools.
Actually, it turns out that they do.
An acquaintance referred me to Stanford’s Mining Massive Data Sets graduate program. The course descriptions sounded perfect:
- Mining Massive Data Sets
- Social and Information Network Analysis
- Machine Learning
- Information Retrieval and Web Search
I signed up, and soon I spent every waking moment studying eigenvectors, hashing, machine learning and Hadoop.
The content was incredible, and the potential for the techniques across my OEM customer base was almost limitless. But it was far from easy. In fact, the program took over my life . . . weekends became purely conceptual. It turns out that learning cutting edge skills from one of the best institutions in the world requires dedication and commitment.
As the program progressed, the shine wore off.
I found myself questioning my abilities. Could I do this? I hadn’t studied algorithms or linear algebra for more than a decade. After a terrible first month, I got myself to stable footing and was keeping up with my peers.
No longer drowning, I felt as though I was treading water during a hurricane. Did I want to do this? The workload crushed my expectations, and I had to drastically realign my priorities at home and at work. Quickly, Big Data had moved from an enticing goal to a substantial resource commitment.
As I approach the end of the first class, the final question became, “now what?” It turns out Big Data is pretty broad, and figuring out where to focus can be challenging when there is so much opportunity laying at your feet.
What to know before starting your flagship Big Data project
Why does my experience matter to you? Any organization, of size one employee to size 100,000 will have to cross the same barriers if they want to tackle a big data problem.
1. Big Data requires a big investment in skills
This is a field of science, not a marketing term. In fact, Big Data should be considered shorthand for “Applying data mining techniques to data sets which are so large they break most data mining techniques.” The papers are still being written on how to solve various types of problems, so don’t expect to just start fiddling.
A scalpel does not make a surgeon. The creation of tools like Hadoop have made the technology very accessible, but the skills to apply this technology appropriately are specialized. Anyone can download Hadoop, import a data set and write a simple word counter. A few in the world have experience running large Hadoop clusters and a few more have experience utilizing those clusters to produce anything truly useful.
Unless you have room for a data scientist, mathematician, top caliber programmer, cluster administrator and budget for months of training, you’d be better off looking to a partner to help scope and deliver the work.
2. Big Data requires a big investment in time
When you pick these crazy data mining techniques to solve a problem, it is probably because that problem has proven unsolvable with traditional methods. Google had to create MapReduce because no other tool could do the job for them.
The implication is that performance is a concern with big data sets, and every detail from networking topology to the way you design your combiners can have a drastic effect on performance. To make matters worse, there is no compiler that can look across your particular hardware and software resources and then magically produce highly optimized settings and code that will make good use of them. Profiling and debugging is extremely tricky.
Unless you stumble upon a team that has already implemented a collaborative filtering recommendation system that incorporates latent factors (assuming that is what you happen to be trying to do) and runs in the cloud provider of your choice or your own private cloud, then you have a lot of experimentation in your future. If you don’t yet have cloud experience, you might be in for double the challenge.
3. Big Data requires knowing the limitations of Big Data
Hadoop is a baby; in December of 2011, Hadoop had its 1.0 release. That was three months ago, as of this writing. It includes eight bolt-on extensions that help fill in gaps where the more generic MapReduce pattern of Hadoop fell short.
Think about that for a moment.
There have been eight gaps in the basic Apache Hadoop project that were each general and significant enough to mandate full blown Apache subprojects as of the 1.0 release of Hadoop.
Do more gaps exist? Of course, and they will continued to be filled by the community.
So, if you are a product marketing person trying to figure out what Big Data means to you, how can you explore the possibilities of your new product without knowing if those possibilities are bound in reality? Maybe adding speech recognition to your newest car stereo product is going to require years of extra work, but creating an ever refining profile of the owner’s driving habits is not so hard using Mahout (a machine learning Hadoop related project).
Toolsets aside, you also need to consider what data sets are available and if they are appropriate for your chosen application.
If this sounds painful, it will be.
The secret to success is simple: how bad do you want it?
A big data project will be challenging, but before dismissing the idea entirely, let’s come back to an important conclusion from above: Big data is making things possible that were previously impossible. When your goal is to be an online retailer who competes by steering customers towards the long tail of your inventory which no brick-and-mortar can match, big data enables you to create a recommendation system that facilitates your strategy. When you want to decode the genome, big data tools allow you to get the job done before the next millennium. When you want to do so simple forecasting, big data approaches are going to be overkill.
That’s not a bad requirement, really. You must be trying to do something significantly new or challenging to require a big data solution. Hopefully we all have a few of these types of projects, for our own sake.