The API Train Wreck Part-2

ApelbaumYaacov API Train Wrack 2

In The API Train Wreck Part-1, I discussed API design factors such as KPI, performance measurements, monitoring, runtime stats, and usage counters.  In this posting, I’ll discuss the design factors that will influence your API’s load and traffic management architecture.

Load and Traffic Management
One of the key issues that you are going to face when opening your API to the world are the adverse effects on your own front-end and back-end systems.  If you are using a mixed model, where the same API is used for both internal line of business applications and external applications, uncontrolled bursts of traffic and spikes in data requests can cause your service to run red hot.

In a mixed model, as the number of external API users increases, it’s only a question of time before you will start experiencing the gremlin effects which include brownouts, timeouts, and periodic back-end shutdowns. Due to the stateless nature of SOA and the multiple points of possible failure, troubleshooting the root cause of these problems can be a nightmare.

This is why traffic management is one of the first and most important capabilities you should build into your API. A good example of this type of implementation can be found in the Tweeter REST API Rate Limiting. Twitter provides a public API feed of their stream for secondary processing. In fact, there is an entire industry out there that consumes, mines, enriches, and resells Twitter data, but this feed is separate from their internal API and it contains numerous traffic management features such as: caching, prioritize active users, adapting to the search results, and blacklisting to insure that their private feed will not be adversely impacted by the public API.

Bandwidth Management
Two of the most common ways to achieve efficient bandwidth management and regulate the flow are via (1) throttling and (2) rate-limiting.  When you are considering either option, make the following assumptions:

  • Your API users will write inefficient code and only superficially read your documentation. You need to handle the complexity and efficiency issues internally as part of your design.
  • Users will submit complex and maximum query ranges.  You can’t tell the user not to be greedy, rather you need to handle these eventualities in your design.
  • Users will often try to download all the data in your system multiple times a day via a scheduler. They may have legitimate reason for doing so, so you can’t just deny their requests because it’s inconvenient for you. You need to find creative methods to deal with this through solutions like of-line batch processing or subscription to updates. 
  • Users will stress test your system in order to verify its performance and stability. You can’t just treat them as malicious attackers and simply cut off their access. Take this type of access into account and accommodate it through facilities like a separate test and certification regions.

In order to be able to handle these challenges, you need to build a safety fuse into your circuitry.  Think of Transaction Throttling and Rate Limiting as a breaker switch that will protect your API service and backend.

Another reason to implement transaction throttling is because you may need to measure data consumption against a time based quota.  Once such mechanism is in place, you can relatively easily segment your customers by various patterns of consumption (i.e. data and time, volume, frequency, etc.).  This is even more relevant if you are building a tier based service where you charge premium prices per volume, query, and speed of response.

Rationing, Rate Limitations, and Quotas
OK, so now we are ready to implement some quota mechanisms that will provide rate limits. But just measuring data consumption against daily quotas can still leave you vulnerable to short bursts of high volume traffic or if done via a ‘per second ‘ rate limits can be perceived as a non-business friendly.

If your customers are running a high volume SaaS solution and are consuming your API as part of their data supply chain, they will probably find it objectionable when they discover that you are effectively introducing pre-defined bottlenecks into their processing requests.

So even if you consider all data requests equal – you may find that some requests are more equal than others (due to the importance of the calling client) or that they contain high rate transactions or large messages, so just limiting ‘X requests per second’ might not be sufficient.

Here are my core 13 considerations for designing the throttling architecture.

  1. Will you constrain rate of a particular client access by API key or IP address?
  2. Will you limit your data flow rate by user, application, or customer?
  3. Will you control the size of messages or the number of records per message?
  4. Will you throttle differently your own internal apps vs. public API traffic?
  5. Will you support buffer or queue based requests?
  6. Will you offer business KPIs on API usage (for billing, SLA, etc.)?
  7. Will you keep track of daily or monthly usage data to measure performance?
  8. Will you define different consumption levels for different service tiers?
  9. How will you handle over quota conditions (i.e. deny request, double bill, etc.)?
  10.   Will you measure data flow for billing and pricing?
  11.   Will you monitor and respond to traffic issues (security, abuse, etc.)?
  12.   Will you support dynamic actions (i.e. drop or ignore request, send an alert, etc.)?
  13.   Will you provide users usage metadata so they can help by metering themselves?

Obviously, no single consideration in the list above will singlehandedly control your design. During your deliberation process, dedicate some time to go over each of the 13 points. List the pros and cons for each and try to estimate the overall impact on the development scope and overall project timeline.

For example, evaluating  “1. Constrain rate of a particular client access by API key” will yield the following decision table:

Pros for Key Management

Cons for Key Management

Impact

Conducive to per-seat and volume licensing models and can support both enterprise and individual developers.

Need to invest in the setup and development of a key lifecycle management system.

Effort is estimated at x man months and will push the project delivery date by y months.

Tighter control over user access and activity.

Need to provide customer support to handle key lifecycle management support.

Effort will require to hire/allocate x number dedicated operational resources to support provisioning, audit, and reporting.

Conducive to a tiered licensing and billing model including public, government, and evaluate before you buy promotions.

Managing international users will require specialty provisioning and multi-lingual capabilities.

Will complement and support the company’s market and revenue strategy.

Will scale well to millions of users.

Initial estimate of number of customers is relatively small.

 

Resist the temptation to go at it ‘quick and dirty’. Collect as much input as possible from your front-end, middle tier, back-end, operations, and business stakeholders. Developing robust and effective throttling takes careful planning and if not done correctly can become the Achilles heel of your entire service.

The above pointers should cover the most critical aspects of planning for robust API traffic management.

© Copyright 2013 Yaacov Apelbaum. All Rights Reserved.

The API Train Wreck Part-1

ApelbaumYaacov API Train Wrack

Developing a commercial grade API wrapper around your application is not a trivial undertaking. In fact, in terms of effort, the development of an outbound facing API can dwarf all combined application development tasks by an order of magnitude.

I’ve seen some large development organizations attempt to open their internal platform to the world just to watch in horror as their core systems came to a screeching halt before ever reaching production.

In this and a follow-up posting, I’ll try to provide you with some hands-on tips about how you can improve the quality of your API and more efficiently ‘productize’, ‘commercialize’, and ‘operationalize’ your data service.

It’s 9 PM, do you know who’s using your API?
Do you have accurate and up-to-date KPIs for your API traffic and usage? Are these counters easily accessible or do you need to troll through numerous web server and application logs to obtain this data? Believe it or not, most APIs start as orphaned PoCs that were developed with the approach of ‘do first, ask permission later’.  Things like usage metrics are often put on the back burner until project funding and sponsorship roll in.

Paradoxically, due to their integrative effect, application APIs can provide the biggest boost to traffic and revenue, but at the same time they can become your biggest bottleneck.

So my first tip is to build usage information into your core functions. Before you embark on an API writing adventure, your product management and development teams should have clear answers to these questions:

How many clients, applications, and individual developers will use your API?

  • Will you offer public and private versions of it?
  • What will the distribution of your clients, apps, and developers by type and location be?
  • Who will the the top tier customers be? Developers? Partners?
  • What parts of the API will be used most heavily?
  • What will the traffic breakdown between your own services and 3rd party services be?
  • What will the aggregate per application and per customer transaction volumes be?
  • How fast and reliable will our service be, (i.e., response, latency, downtime, etc.)?
  • What are the architectural and technical limiting factors of your design (i.e. volume)?
  • How will the API be monitored and SLAd?
  • What type of real-time performance dashboards will you have?
  • How will the public facing API usage effect the rest of your platform, (i.e., throttling)?
  • Will you handle error routines generically or identify and debug specific messages?
  • Do you need to support auditing capabilities for compliance, (i.e SOX, OFEC, etc.)?
  • What will the volume of the processed/delivered data be, (i.e. from a third parties)?

Asking these questions—and getting good answers to them—is critical if you plan to develop a brand new SaaS platform to expose your internal platforms to the world.

This information is also critical to your ability to establish contractual relationships with your internal users and external customers. Depending on the type of information you serve, it may be also mandatory for compliance with state and Federal regulations.

Collecting the correct performance metrics and mapping them to your users and business functions will help you stay one step ahead of your user’s growing demands.

© Copyright 2013 Yaacov Apelbaum. All Rights Reserved.

The Startup Leap to Success

Yaacov Apelbaum-The Startup Product Leap

One of the most challenging periods for any startup is passing through the “Valley of Death”. During this delicate phase, the organization’s burn rate is high and it has to rapidly achieve the following three goals:

  1. Move from a proof of concept (POC) to a functional commercial product
  2. Reach a cash flow break even
  3. Transition form seed\angel funding to venture capital funding

For startups focusing on the development of SaaS products, this phase also marks an important millstone in the maturity of their product. With increased volume of production users comes stricter SLA’s and the need to implement more advanced operational ability in areas such as: change control, build automation, configuration management, monitoring and data security.

Yaacov Apelbaum-Startup Financing Cycle

If you are managing the technology organization in an early stage startup, you have every reason to be concerned. To the outsider, the success and failure of startups often seems to be shrouded in mystery–part luck part black magic.  But ask a seasoned professional who has successfully gone through the startup meat grinder and he will tell you that success has nothing to do with luck, spells, or incantations.

Having worked with a number of startups, I have come to conclude that the most common reasons for product failure (beyond just not being able to build a viable POC) is the inability to control your product’s stability and scalability.

In the words of Ecclesiastes, there is a time and purpose for everything under heaven.  In the early stages of a startup’s life cycle,  process is negotiable.  Too much process may hinder the speed in which you can build a functional POC.  In later stages, reliable process and procedures (e.g. requirements, QA, unit testing, documentation, build automation, etc., ) are critical. They are the very foundations of any commercial grade product.  Poor quality and performance are self evident and no matter how much marketing spin and management coercion you use, if you are trying to secure an early stage VC funding round, your problems will rapidly surface during the due diligence process.

To avoid the startup blues, keep your eyes on the following areas. Factoring them into your deployment will help you deliver on time and on budget, with the proper scalability and highest quality possible.

Design Artifacts
Before converting your POC to a functional product, take the time to design your core components (i.e. CRM, CMS, DB access, security, API, etc.).  Create a high level design that identifies all major subsystems.  Once you know what they are, zoom into each subsystem and provide a low level design for each these as well.

  • Resist the temptation to code core functionality before you have a solid and approved scalable architecture (and the documentation for it). 
  • Let your team review and freely comment about the proposed platform architecture and deployment topology.  Just because a vocal team member has religious technology preferences doesn’t mean that everyone should convert.
  • No matter how good your technical staff is, when it comes to building complex core functionality (transaction engine, web services API, etc,) don’t give any single individual carte blanche to work in isolation without presenting their design to the entire team.
  • Document the product as you develop it. Building a complex piece of software without accurate documentation is akin to trying to operate a commercial jet without its flight manual.
  • To help spread the information and knowledge, establish a company-wide document depository (like a Wiki or SharePoint ) and store all your development and design documents under version control.  Discourage anyone from keeping independent runaway documents of the system.
  • Maintain an official (and versioned) folder for the platform documentation showing product structure and components, development roadmaps, and technical marketing materials. 

Testing and QA
If you are not writing unit tests you have no way to verify your product’s quality. Relying on QA to find your bugs means that by the time you do (if ever!) it will be too late and expensive to fix them.  Spend a little extra time and write unit tests for every line of code you deploy in production.  When refactoring old code, update the original unit test as well.

Just like most things in life, bugs have a lifecycle, they are born, they live and die.  Effectively tracking them as part of your build and QA process is a prerequisite for their timely resolution.  

If you are discovering a high critical bug count in your “code complete release” (half a percent of source code e.g. 500 bugs for a 100,000 line code base), you may not be production ready.  Stop further deployment and conduct a thorough root cause analysis to understand why you have so many issues. 

If your bug opening/closure rate remains steady (i.e. QA is opening bugs at the same rate development is closing them) and you have reoccurring bug bounces, you may need to reassess the competency of your development resources. This would also be a good time to have a serious heart to heart conversation with the developers responsible for the bugs. Be prepared for some tough HR decisions.

Monitoring and Verification
Just like you wouldn’t drive a car without a functional dashboard, you can’t run quality commercial software without real time visibility into its moving parts.  Implement a monitoring dashboard to track items such as daily builds (and breaks), servers performance, users transactions, DB table space, etc. 

Seeing is believing. Products like Splunk can help you aggregate your operational data.  Once you have this information, show it to your entire team. I personally like to pump it onto a large screen monitor in the development areas so everyone can get a glimpse.

Yaacov Apelbaum-Splunk Monitoring
Image 1: Splunk Dashboard in Action

Security, Scalability and Operations
Unless you are in the snake oil sales business, build your production environment from the get-go for scalability, security, and redundancy.  Don’t look for “bargains” on these technologies, leverage commercial-grade load balancers, firewalls, and backup solutions.

Your production environment is critical to your success, so don’t treat it as a second class citizen or try to manage it with part time resources. As you will quickly discover, a dedicated sys admin and a DBA who know your platform intimately are worth their weight in gold.

You must achieve operational capabilities in build automation, release management, bug tracking, and configuration management before going live.  If you don’t, be prepared to spend most of your productive time fixing boo-boos in the wee hours of the night.

Implementing many of the above mentioned measures will give you a significant tactical advantage as well as a strategic boost when negotiating with potential VCs.  Having these capabilities on your utility belt will also help you calmly face any future issues as your startup matures.

 

© Copyright 2011 Yaacov Apelbaum All Rights Reserved.

Ode to The Code Monkey

 Yaacov Apelbaum-Code Monkey  
The Code Monkey (inspired by A Dream Within A Dream by Edgar Allan Poe)

Take another slap upon the cheek,
While slaving on this project, week by week. 
You have been wrong to work so hard,
Expecting riches and managerial regard.
Grinding out functions awake and in a dream,
Will not fetch rewards or professional esteem.

What you lack are not more lines of code, 
Rather it’s architecture and a road.
To substitute quality with speed,
Is the motto of the code monkey creed. 
You who seek salvation in RAD extreme
Will find, alas, a dream within a dream.

If you examine your latest stable build,
You will notice many bugs that haven’t been killed.
Strangely, they seem to grow in relation,
To your oversized code base inflation.
So many new features! How did they creep? 
Through scope expansion, they trickle deep.

Building good software is hard to manifest,
If you fail the requirements to first digest.
The lesson to learn from this development ditty,
Is that no matter how clever you are or witty,
If you fudge the schedule and estimation phase,
There is but one reward for you. The death march malaise!

© Copyright 2010 Yaacov Apelbaum All Rights Reserved.

To Make Errors is Human, to Handle Them is Divine

Yaacov Apelbaum-We Have Bugs

Reading this advertisement made me realize just how clever the software industry has become.  Why bother fixing your bugs prior to shipment when you can sell it on the premise that you will fix the bugs “free of charge” when the users find them for you. Interestingly, anyone who bothered to read their licensing guide will find the following sobering caveat:

“…From an engineering point of view, it is impossible to fix bugs in multiple source code branches. If we would have to do this, we would never be able to implement a major redesign. Major redesigns are required now and then to be able to fix bugs and add features fast.” 

Nothing communicates your attitude towards your users better than the way you handle exceptions and errors. As soon as something goes wrong with your application the user is at a heightened emotional state and is the most impressionable. Some software products, including some of leading market applications, have developed a bad reputations for having cryptic error messages that are impossible to resolve, leaving the user feeling helpless and outraged.

The worst offenders include fortune teller style messages that inform you (not without irony) that you are about to lose all of your work because the application has encountered an unknown problem and needs to shot down.

Yaacov Apelbaum-Useless Error Message
A Useless Error Message

This is even more pronounced in the session-less environment of the internet. It seems that when it comes to web application reliability and robustness, we’ve been steadily taking a step backward in the way we treat our users.

Yaacov Apelbaum-Lost my Browser Yaacov Apelbaum-Stopped Working

Yaacov Apelbaum-Blogger Error Yaacov Apelbaum-Microsft Live Writer Error
Useless Error Messages

The Engineering Handling Failure
A civil engineer designing a bridge will invest a significant amount of time and resources in predicting potential structural failure scenarios. Failure analysis and safety factoring (i.e. redundancy) are two important cornerstones of the engineering discipline. In the physical world of machines and structures, the ability to identify a potential design flaw and remedy it is a given. Similarly, we should strive to achieve the same in the virtual software world by accounting for critical error conditions and developing robust application code capable of handling those cases.

Software engineering does have certain nuances that differ from classical engineering, which makes prioritization of work more arbitrary and less straightforward. For example, a memory leak in a server component may be considered by the development team to be a critical bug, but a relatively small data validation problem that forces the user to retype a lengthy application could have a bigger user impact and rank higher on the bug fix priority.

A 12 Step Program for Error Rehabilitation
Making your application more agile in handling failure and enabling it to degrade gracefully are not a single step processes and there is no silver bullet technology out there that will fix this problem.  If you want to break the cycle of application instability and user frustration, you will have to dedicate time and your best technical talent to solving it. I have found that a phased approach works best.  In this approach you first handle the low hanging fruits, (addressing the mechanics of the error handling), and than gradually move to higher ground (addressing automated problem resolution and preemptive countermeasures).

The following is my 4-phased program for solving your application errors. Classification is inclusive, so the 4th phase (the highest level of reliability) also includes the properties of the preceding levels:

Phase-1: Create Unique and Traceable Errors and a way to Record them
If you are under the gun and don’t have time for any other remedy, at least make sure that your error cases are unique. Telling your users that an error has occurred in the application without providing details is a sign of an immature product. When your technical support team receives an error report, they should be able to determine precisely what is causing the problem.

Generic error handling (same message for all errors), or different error causes that return identical messages, are easy to implement, but when it comes to debugging they are useless. Unique error IDs allow us to more efficiently track bugs and translate them to a more stable product.

Error codes should be visible in the error messages but not be the focal point of the the message.  You should develop a library of descriptive text that provides a human readable explanation of what the error means.  Provide a simple mechanism to either log the message directly into your app or send it to you via email.  Nothing is more annoying to the user than being asked to type in the error message manually.

Establish an Issue Tracking System that allows quick data entry and reporting.  At the minimum record the error code, error description, and the steps to reproduce it, effected environments, and its frequency.

Phase-2: Keep the User Calm and his Data Safe
Error messages should always carry a mature and responsible tone. Always use supportive, polite language, like a good teacher would when instructing a pupil.

If the user opts to leave a mandatory field empty, or mistypes the data type (CC#, zip , etc.), don’t go ballistic. Non-critical errors deserve non critical messages. Instead, indicate on the entry form where the problem was, place the cursor in the relevant field and leave the rest of the data intact. This is especially important for long entry forms that require a lot of effort to complete.

Don’t force the user to duplicate entry of some previously supplied data for verification purposes (such as billing and shipping information) as it may introduce human error and trigger him to abandon the application altogether.

Phase-3: Good Errors Messages are Clear and Provide Remedies
The way the user perceives the error is much different from the way you do. He thinks in business terms and knows nothing about the inner workings of your application, nor does he care. That’s why you should always design the error UI from the user’s perspective.

Here are the seven golden attributes of error messages:

  1. Describe the error in user terms and language
  2. Instruct the user as to how to complete the task and resolve the error
  3. Explain how to prevent the problem in the future
  4. Avoid technical mumbo jumbo and acronyms
  5. Avoid modal pop up error messages and instead write error directly to the page
  6. Provide help links that better explain the nature of the error
  7. Keep the text formatting simple and avoid bright colors and animations

When providing a solution, give clear step by step instructions as to how to fix the problem. Be specific and do not assume any pervious user knowledge. If there is a relevant tutorial or the specific solution in your on-line help, provide links directly there. If it’s a critical problem—for example, the Website is not accessible—provide a mechanism for the user to report the problem to you and immediately acknowledge the receipt of his complaint, provide an explanation and an estimate of time before this problem will be resolved.

Phase-4: Handle Errors Internally
Write code to robustly handle all errors. This will eliminate the most severe and common errors (like missing data or validation). You can achieve this by automating data entry components from the user interaction (i.e. deriving city name from zip code).

To the extent possible, take corrective action before an error occurs. For example, if the user is in the middle of a lengthy entry form, save the contents as he moves between fields, this will allow you to restore the information if he inadvertently navigates off the page or even closes his browser session.

It’s often expensive to identify and address all possible failure cases, but if you have been tracking your top bugs, you can start with the biggest offenders first.

The way you handle and communicate application errors directly reflects on your team’s and your company’s reputation. When building a new or reworking existing functionality don’t assume that the old error messages apply to your new logic and boundaries.  Building test cases around various error scenarios (missing data, wrong data, bad data, etc.) and dedicating a test cycle to generate all known error messages is also an excellent strategy.

Error handling and messages should be thought of as required phase of any feature development, and adequate engineering time for it should be budgeted into all SDLC estimates.

Real quality of service goes beyond just acknowledging your application’s faults. My rule of thumb is that there is no such thing as an “informative error message”. A good error is one that has been eliminated through error-handling code and through superior product design.

© Copyright 2010 Yaacov Apelbaum All Rights Reserved.

It’s Good Enough for Me

Yaacov Apelbaum-Giacamo the Good

I commute frequently, so I tend to have some down time at the airport while sitting at the gate and waiting for my ship to come in. I usually use this window to catch up on my technical reading, but recently I decided to take a break and venture in to one of the book stores in the concourse. After skimming the offerings, I discovered a bookshelf filled with titles of the “How I Became the Best In ___, and How You Too Can By Simply Following My Easy Three-Step Program” genre. These books, mind you, are not cheep paper backs. I was looking at thick hardbacks, generously illustrated and accordingly priced. Apparently, the “How to Become the Best” series industry is booming.

This got me thinking: statistically speaking, the best of any kind takes up only a tiny outlier of the bell curve. So why the hype? Clearly, if this industry is thriving there are enough literate people out there who were willing to buy into the idea that being the “best” is worth their time and money.

Then a few weeks ago, I found myself confronted with this concept again. I was having lunch with a colleague and he raised the argument that the only way to win in today’s lean software economy is to develop the “best” features and functionality. He expressed his strong conviction by recounting his recent experience at a trendy “how to become the best” seminar. “I am a new man,” he said, “This event has changed my entire outlook on product development”. “How’s that?” I asked, curious. He leaned forward, squinted, and in a lower and somewhat more mysterious voice, he summarized his newly acquired philosophy. He said that according to the presenters, Trump, Robbins, and Kiyosaki, success hinges on one’s ability to tap into one’s inner best. Either you’re Napoleon or you’re out of the game.

At this point, I was done with my burrito and so I seized the opportunity to respond in kind with a rival French maxim. I quoted Voltaire: “Le mieux est l’ennemi du bien” (The Best is the Enemy of the Good). Wellington, I pointed out, was by no means the best, but he certainly outlasted Napoleon in the game.

My companion was startled and said he didn’t understand what I meant. I offered an explanation: “It’s not that I am a proponent of mediocrity; to the contrary,” I said, “I pride myself on my attention to quality. I have absolutely no problem with the concept of pursuing excellence. What prevents me from realizing perfection are mundane details such as looming deadlines, shrinking budgets, and a chronic shortage of resources.”

Of course it’s easy to invoke demagoguery and claim that it’s either “best” or “bust”. Many development managers adapt this mistaken philosophy, assuming that it has a positive motivational value. The average corporate culture doesn’t help dispel this myth either, by creating unattainable criteria for personal performance and compensation plans. Regardless of how fond of the cliché’ you may be, unfortunately preaching the best when it comes to delivering software under time, quality, and budgetary constraints is one thing, actually being able to deliver on such promises is quite another. If we learn anything from human endeavors, it is that “good enough” is more than acceptable. As far as I know, most of us don’t drive the best car on the market, live in the best built house, or exclusively buy the best clothes or appliances. Compromise is the order of the day.

My favorite story that illustrates this concept is the World War II race to develop the radar. Both British and German teams were aware of the tremendous operational and strategic advantage this new technology could offer. The German development team had the more advanced science and superior technology. Their radar was more accurate, had a longer range, and provided fewer false-positives. The German team—true to their cultural heritage—was striving to develop the best apparatus possible. The British team was smaller, less experienced, and had inferior technology. But from the outset, it adopted the motto: “Second Best Tomorrow”. This philosophy eventually allowed them to release an inferior but working radar earlier than the Germans thus winning the race and ultimately tipping the balance of power.

Cheap (often free) and simple software free of stringent SLAs is popping up everywhere. Most of us now get our breaking news from Google and personal blogs, case in point. We make free, long-distance calls on Skype (and don’t mind the low QoS), watch video on tiny iPods screens rather than high definition TVs, and more and more of us are using low-power cell phones that are just good enough to meet our surfing and emailing needs. For many leading companies, the distinction between good enough “beta” versions and commercially “best” products has blurred beyond recognition. (Gmail has finally come out of beta after more than 5 years.)

To be successful in commercial software development, one must fight the urge to gold plate by adding late stage functionality. One must also learn how to be firm regarding ad nauseum pressure for application re-writes, all in the name of making it the best.

Contrary to what the motivational posters profess, when it comes to shipping on-time, the pursuit of perfection can become your worst enemy. The same also applies to excessive QA and testing. In the end, even the most comprehensive white, gray or black box tests can only provide a projection of how your application will perform. The ultimate usefulness gauge are the real users. The earlier you release your product into the wild, the faster you’ll discover if it adequately fills a need.

As I have discovered on many occasions, building a good enough product and releasing it early enough is good enough for most customers—which is good enough for me.

© Copyright 2009 Yaacov Apelbaum All Rights Reserved.

The Death March

Yaacov Apelbaum-TheDeathMarch

Most software professionals view themselves as the masters of their own destiny, analytical and calculated, wisely exercising free choice in all matters of importance.

This was certainly my mental image of myself as well until several years ago, when I gradually came to realize that given enough time on the job, even the most experienced development manger will eventually have to venture into that dark and irrational world of the death march project.

For those unfamiliar with the term, a death march is not a walk through Ezekiel’s valley of dry bones. Rather, it is a reference to a development project where requirements either exceed the realistic deliverables by at least 50 percent or where critical resources are cut in half without adjusting functionally and schedule delivery accordingly.

Contrary to the common misconception, death march projects are not limited to only naïve and over ambitious startups. To the contrary, they are quite common in large and mature organizations that should know better yet for some poorly understood reason continue to practice every form of anti pattern known to man. How do I know this? Well to confess my sins, over the years I’ve participated in several of these projects.

Perhaps you are wondering why any rational person would choose to participate in or initiate a project that from its onset is clearly doomed to fail. The answer has to do with the adaptive strategies we use in order to survive in highly competitive and schedule driven corporate environments.

When performing a post mortem, most death march software projects typically exhibit the same pathology. The prominent finding is that the team has worked twice as hard and/or twice as long as would be expected in a “normal” project. So for example, if a normal work week is 45 hours, then a death march project team works 15-hour days, six days a week for a 80 hour week. Of course, thanks to a steady diet of caffeine and management coercion, the pressure within the team eventually escalates beyond control and leads to project meltdown.

The psychological drivers behind the willingness of individuals to join what is clearly a long and drawn out sadomasochistic exercise stems from the strong disdain that many of us have for organizational politics and our refusal to take any part in it. Unfortunately, by not participating in the political horse trading, we sacrifice our ability to effectively influence these irrational projects and leave all the decisions to corporate politicians who have little stake in the actual development effort.

Scott Adams, commented on this form of irrational behavior:

“Nothing defines humans better than their willingness to do irrational things in the pursuit of phenomenally unlikely payoffs. This is the principle behind lotteries and dating…”

Having reached this realization myself, I eventually started wondering if there were any early signs or warnings that could help identify an imminent death march. After some introspection and reexamination of previous projects, I have come to conclude that any of the following three (individual or combined) project scenarios will almost guarantee the formation of a death march:

  1. Naivety of Youth—The schedule has been compressed to less than half the amount estimated based on previous delivery; so for example a project that would normally be expected to take six months will be set to be delivered in three months or less. This form of a death march is most common in ra-ra speech environments and “Internet time” startups that naively believe that when it comes to their ability to deliver the “sky is the limit”.
  2. The Senility of Old Age—The development team has been reduced to operate at half the capacity. This may have come about as a result of management’s belief that a new development language, framework or technologies will double the team’s productivity. This is often seen in older companies that are downsizing while at the same time transitioning from one technology to another.
  3. Offshoring Hell—The budget for the project has been cut in half because the business believes that offshoring it is a cheaper alternative. In this scenario, the project manager is informed by the business unit sponsoring the project that it’s a “take it or leave it deal” and if the development manager doesn’t accept the budgetary constraints the business unit will offshore the entire project for less. Thus, in an attempt to save his team from the chopping block, the development manager accepts an impossible challenge.

    Another interesting side effect of this type of project is that as soon as management finally realizes that the project is going nowhere fast, they try to salvage it by throwing additional offshore resources at it, which leads to further delays (Brooks’s law).

It is true that many of the contributing factors to a death march may be beyond your control, but if you find yourself involved in one of these coveted assignments don’t panic, take notice. Contrary to the advice dispensed by some purists (i.e. transfer to another team), being assigned to such a project doesn’t mean that you should abandon it or quit your job. My advice is to keep your ethics and personal priorities separate from the politics of the project. Do your best to contribute to the success of the development, but in so doing be sure to set your manager’s expectations to realistic levels.

State your concerns in a non-argumentative and level headed manner and clearly communicate your conditions for participating in the project in terms of exactly how much overtime you will agree to and your willingness to work weekends and holidays.

Without advocating or orchestrating a mutiny, encourage your team members to speak their minds as well. In these ways, although you may not be able to cancel the project, you will likely succeed in regaining some control over it and reduce the amount of stress everyone on your team incurs.

Happy coding!

© Copyright 2009 Yaacov Apelbaum All Rights Reserved.