It may not be obvious, but social media (SM) have numerous applications that go beyond simple socialization. Beside the voyeuristic and self promoting aspects, SN data is brimming with fresh, cheap, and accurate target information. This includes, age, demographics, purchasing habits, buying power, education, brand loyalty, influence, and income, just to name a few.
This is pretty powerful stuff, as the insight that can be gleaned from millions of users posting near real-time could revolutionize the way products are launched and marketing decisions are made. It’s no longer necessary to guess what buzzwords will resonate with users for your next campaign – users are already using those words in their public conversations. There’s no longer a reason to take a spraying and praying advertising approach in the hopes that an add will be seen by a fraction of the right buyers. Now, you can easily determine where your target population hangs out and pursue them directly.
So with such promise to disrupt the market, why hasn’t big soft moved into this address space yet? Where are products like the Microsoft Social Media Analytics Server or the IBM Social Network BI Aggregator? After all, large volume data analytics have been around for quite some time. Over the past 20 years, giants the likes of Microsoft, IBM, and Oracle have invested billions of dollars in developing enterprise analytics and decision support solutions. Why not adapt their existing platforms to harvest the Internet through the cloud?
The answer to these questions has a lot to do with the problem of low data quality and inconsistency. A close examination of blog, forum, Twitter, or Facebook data reveals that they are all a hodgepodge of tidbits of personal information, non-threaded conversations, and poorly typed, spelled, and formatted communications, which renders them virtually useless for structured analytics engines.
You may argue that at least some of the SM analytics companies must be doing something right. That maybe so, but there is no quantifiable way to gauge how much of their analytics are based on real math and how much is snake oil salesmanship. Many of the SN analytics providers claim that they developed patented technology to sort through volume, noise, and poor data quality. Others insist that their “secret sauce” algorithms allow them to calculate engagement, find patterns, and even accurately track memic propagation. Most of these claims are dubious at best. There are many reasons for this, but the operative ones are:
Most SN analytics providers don’t harvest their own data, and those that do certainly don’t do so in real-time. Rather, they subscribe to data scraping services like Compete, comScore, Hitwise, Nielsen, Quantcast, etc. The data harvesters only collect data from a small fraction of the relevant websites, blogs or forums. They do so on a schedule that could be as long as 2 weeks. Obviously, password protected and membership only sites are off limit. What you get then is a tiny sliver of a weighted sample population that could be weeks old.
Companies that scrap platforms like Facebook or Twitter do it via a native API. Due to system performance concerns, the SN sites throttle the amount of data they expose via the API. If you are looking at a real-time monitoring solution of any of the social networks, be prepared to have very large data gaps and timeouts in your dashboard.
Algorithms for determining text sentiment, theme, writer’s gender, age, and education are only effective on large and well formatted compositions. They were designed to work on structured essays that are around 1000 words long. The likelihood of accurately determining any of these characteristic from a 140 char tweet or a blog posting that is riddled with expressions like LOL or OMG is as good as a coin toss.
Even the largest data providers only scrape less than 1 percent of relevant Internet data. The analytics you are viewing probably represent information found across no more than a handful of sites, blogs or forums. Making multi million dollar advertising decisions based on such low quality and small data sets could be risky.
Due to the growing availability of automated tools for the creation of blogs, websites, and posts, we are starting to see a significant amount of machine generated content that is designed to pump-up SEO visibility for adware sites. Data scrapers are unable to distinguish between machine generated and human typed content, which can result in skewed analytics.
Data feeds frequently go through secondary processing before they are presented to users. This additional refinement may include the removal of partial records (i.e. missing dates, user names, etc.) or offensive message content like cursing, pornography or spam. All this massaging further reduces the population size and the accuracy of the results.
So what is the moral of the story? If you are on a quest for the SM analytics holy grail, you won’t find it, because it all depends on how much YOU are willing to compromise in terms of data sample size, quality and accuracy.
If you are in the market for a SM analytics tool, don’t take any chances by committing to a specific solution before doing your homework. Ask the vendor to explain to you (in 8th grade level English) how they address the six items mentioned above. Arrange for a trial period with at least three vendors and then compare their analytics to each other and a benchmark and a ground truth known to you. This should give you a sense of how accurate the tool is and the margin of error you can expect moving forward.
© Copyright 2010 Yaacov Apelbaum All Rights Reserved