Experience with big analytics is costly

Move over cloud. Big Data is all the rage these days.

Marketers are talking about Big Data. CEOs are telling their product teams to go after Big Data. Reporters are trying to get the scoop about Big Data. Analysts have entire practices devoted to Big Data market research. It seems that nothing, not even our beloved “cloud”, is as shiny.

As a marketer for the OEM Solutions team, I found myself enjoying the discussion over the last year as “Big Data” slowly crept into my vernacular.  I read some whitepapers on MapReduce, and attended presentations on Cassandra, Pig and Mahout. I found myself feeling like an expert, much like most people probably feel after watching an inspiring TED talk.

Yet, as a computer engineer, I found myself questioning my inch-deep understanding and the more worrisome confidence it inspired. Two hour long business meetings would feature a customer’s CEO talking about Big Data (sometimes thinly veiled by the more tempered term “analytics”) as the next big market for his company to go chase, and the room would nod their heads in approval because . . . everyone knew that “analytics is a 10^43 (the precise value of a gazillion) dollar market”, regardless of exactly what was being analyzed, how or for whom. No product was safe from this feature creep.

One thing in particular stuck out: almost no one I met at trade shows, conferences or even customer meetings had even basic experience gathering, mining, or refining data in massive quantities. The dichotomy between those talking about Big Data and those doing Big Data was striking, and I was a talker.

The marketer in me saw the potential to be two inches deep and give tantalizing elevator pitches. The geek in me saw the chance to gain much needed expertise and help customers navigate this emerging subject.

The challenge: How do you learn about Big Data?

Big Data is shiny and new; it’s not like they teach this stuff in schools.

Actually, it turns out that they do.

An acquaintance referred me to Stanford’s Mining Massive Data Sets graduate program. The course descriptions sounded perfect:

  • Mining Massive Data Sets
  • Social and Information Network Analysis
  • Machine Learning
  • Information Retrieval and Web Search

I signed up, and soon I spent every waking moment studying eigenvectors, hashing, machine learning and Hadoop.

The content was incredible, and the potential for the techniques across my OEM customer base was almost limitless. But it was far from easy. In fact, the program took over my life . . . weekends became purely conceptual. It turns out that learning cutting edge skills from one of the best institutions in the world requires dedication and commitment.

As the program progressed, the shine wore off.

I found myself questioning my abilities. Could I do this? I hadn’t studied algorithms or linear algebra for more than a decade. After a terrible first month, I got myself to stable footing and was keeping up with my peers.

No longer drowning, I felt as though I was treading water during a hurricane. Did I want to do this? The workload crushed my expectations, and I had to drastically realign my priorities at home and at work. Quickly, Big Data had moved from an enticing goal to a substantial resource commitment.

As I approach the end of the first class, the final question became, “now what?” It turns out Big Data is pretty broad, and figuring out where to focus can be challenging when there is so much opportunity laying at your feet.

What to know before starting your flagship Big Data project

Why does my experience matter to you? Any organization, of size one employee to size 100,000 will have to cross the same barriers if they want to tackle a big data problem.

1. Big Data requires a big investment in skills

This is a field of science, not a marketing term. In fact, Big Data should be considered shorthand for “Applying data mining techniques to data sets which are so large they break most data mining techniques.” The papers are still being written on how to solve various types of problems, so don’t expect to just start fiddling.

A scalpel does not make a surgeon. The creation of tools like Hadoop have made the technology very accessible, but the skills to apply this technology appropriately are specialized. Anyone can download Hadoop, import a data set and write a simple word counter. A few in the world have experience running large Hadoop clusters and a few more have experience utilizing those clusters to produce anything truly useful.

Unless you have room for a data scientist, mathematician, top caliber programmer, cluster administrator and budget for months of training, you’d be better off looking to a partner to help scope and deliver the work.

2. Big Data requires a big investment in time

When you pick these crazy data mining techniques to solve a problem, it is probably because that problem has proven unsolvable with traditional methods. Google had to create MapReduce because no other tool could do the job for them.

The implication is that performance is a concern with big data sets, and every detail from networking topology to the way you design your combiners can have a drastic effect on performance. To make matters worse, there is no compiler that can look across your particular hardware and software resources and then magically produce highly optimized settings and code that will make good use of them. Profiling and debugging is extremely tricky.

Unless you stumble upon a team that has already implemented a collaborative filtering recommendation system that incorporates latent factors (assuming that is what you happen to be trying to do) and runs in the cloud provider of your choice or your own private cloud, then you have a lot of experimentation in your future. If you don’t yet have cloud experience, you might be in for double the challenge.

3. Big Data requires knowing the limitations of Big Data

Hadoop is a baby; in December of 2011, Hadoop had its 1.0 release. That was three months ago, as of this writing. It includes eight bolt-on extensions that help fill in gaps where the more generic MapReduce pattern of Hadoop fell short.

Think about that for a moment.

There have been eight gaps in the basic Apache Hadoop project that were each general and significant enough to mandate full blown Apache subprojects as of the 1.0 release of Hadoop.

Do more gaps exist? Of course, and they will continued to be filled by the community.

So, if you are a product marketing person trying to figure out what Big Data means to you, how can you explore the possibilities of your new product without knowing if those possibilities are bound in reality? Maybe adding speech recognition to your newest car stereo product is going to require years of extra work, but creating an ever refining profile of the owner’s driving habits is not so hard using Mahout (a machine learning Hadoop related project).

Toolsets aside, you also need to consider what data sets are available and if they are appropriate for your chosen application.

If this sounds painful, it will be.

The secret to success is simple: how bad do you want it?

A big data project will be challenging, but before dismissing the idea entirely, let’s come back to an important conclusion from above: Big data is making things possible that were previously impossible. When your goal is to be an online retailer who competes by steering customers towards the long tail of your inventory which no brick-and-mortar can match, big data enables you to create a recommendation system that facilitates your strategy. When you want to decode the genome, big data tools allow you to get the job done before the next millennium. When you want to do so simple forecasting, big data approaches are going to be overkill.

That’s not a bad requirement, really. You must be trying to do something significantly new or challenging to require a big data solution. Hopefully we all have a few of these types of projects, for our own sake.

Josh Neland

Enjoy this post as much as we did? Join our email list and stay plugged in!

4 Comments

  1. Nice article Josh!

    By: DataH on April 11th, 2012 at 9:35 am
  2. Beautiful. Definitely not the first time this has happened either. Glad to hear the real parts – how hard it is to start, the theory behind it. Fantastic.

    By: Matt Beran on April 11th, 2012 at 10:53 am
  3. Josh, I was so glad to find your article on the Stanford program. I am a marketing MBA who is returning to the workforce after time with kids. I have been reading about predictive analytics and have been drawn to the promise and growth. (I have also been reviewing the Stanford course.) Did you continue after the first course? Do you feel you are learning what you need to work in the field — or do you think you are more likely to manage the process and call in experts? Thanks!

    By: Lee McComb on April 30th, 2012 at 11:28 am
  4. Lee,
    Thanks for your comment and questions.

    I would recommend this program to anyone looking to drink from a fire hose and learn about some of the coolest things happening in computer science at the moment.

    You didn’t state it in your description, so I want to be clear that a computer science (or engineering) degree is an absolute prerequisite as this is a computer science master’s program. You need to be comfortable creating proofs that involve series, serious matrix algebra and big-O concepts. Additionally, you need to be comfortable with an array of programming languages.

    I have continued on after the first course (CS246: Mining Massive Data Sets) and am now auditing CS276 Information Retrieval and Web Search to allow me to catch up at work. The classes are only offered once every couple of semesters, so it will be a two year journey. The other students reported that they spent between 20-30 hours of time outside of class each week on homework assignments, and I spent more because of the steep learning curve; expect your first class to be very hard and find study partners immediately.

    The skillset taught by the program’s instructors is outstanding. I can identify applications, select algorithms and write my own systems. These same skills help me to identify a good consultant/partner when necessary. It’s really hard to separate practical knowledge from the strategic function at this point, as this is very rapidly evolving stuff constrained by an understanding of the science and current tools.

    Dell’s OEM business helps our customers find emerging technologies and apply them appropriately to their business. After talking with customers in security and surveillance, healthcare and point of sale, it is obvious that the opportunities are almost limitless to apply this skillset in very transformative ways even in seemingly mature industries.

    Another way of saying this: the market potential of these technologies is huge, limited only by the understanding of the technologies themselves and the time it takes for them to creep into existing industries.

    Please let me know if there is anything else I can help answer.

    Congratulations on spending some valuable and precious time with your kids, and I wish you good luck with whatever you choose next.

    Josh

    By: Josh Neland on April 30th, 2012 at 1:55 pm

Leave a Comment

Author Contact Form