The Cost of Downtime
Everyone understands that downtime for a website or online service is not a good thing. A simple example can be an online shop going down and is cannot process new customer orders. The cost of downtime in this example is the potential lost revenue during the time of the outage.
Lost revenue for the duration of an outage is easy to understand and calculate. However, the costs of said outage often do not end there. Depending on the type of business or service, downtime can have lasting detrimental effects on the future of your operation. Being away of the additional costs of downtime will help you prepare for outages and build a more robust product that customers can trust. This article will center around raising awareness about these lesser-known costs so that readers will be better prepared to mitigate the effects of an inevitable outage.
Direct revenue and sales
Customers and users cannot use your online services if they cannot reach them. Revenue and sales lost during an outage may not be recoverable if customers are likely to move on to competitors after witnessing an outage.
According to Gartner, the average cost of downtime is $5,600 per minute. Of course, large billion-dollar corporations inflate this average quite a bit. However, for small businesses the average cost is still $137 to $427 per minute. If you are unaware of an outage and it lasts longer than an hour, the financial costs can be devastating.
Customer loyalty and trust
Customer/user time is valuable, and when they are met with a 500 Internal Server error from your website, there is a breach in trust. If you offer a web application SaaS and users are unable to access it, your app’s reputation for being a reliable product is hurt. An outage gives users a reason to seek out competing products and churn, leading to lost revenue. On top of that, if you rely on good reviews and word-of-mouth to market your business, users that leave because of outages are unlikely to recommend you to their peers, capping your growth.
In fact, customer churn and reputation damage are the 1st and 3rd largest downtime costs, respectively. Yet, it is difficult to place a dollar value on damaged reputation because the effects on revenue can linger well after the outage depending on the business model. Regardless of the product or service, reputation for being reliable is important for users.
SEO
Although only a problem if downtime lasts days or weeks, web crawlers will notice that your website is down. As a result, your website will drop in rankings, leading to loss of traffic and sales. Building up good SEO is a long and strategic process, and a severe outage can wipe out those SEO gains easily.
Although the extent to which downtime can affect SEO rankings is unclear, there is an effect that should be taken into account when calculating the costs of downtime.
Internal productivity
Internal productivity is slowed every time there is an outage. Resources and time must be directed away from core revenue generating projects and towards diagnosing and incident management. If there are no established processes for quickly diagnosing and fixing downtime, more and more time is spent away from core business tasks.
Context switching and thrashing between fixing downtime and normal work is terrible for morale and lead to burnout faster than expected. The health of the product depends on the health of those working on it. Frequent outages and context switching indicates a need for a significant change in how downtime is handled.
Minimizing downtime costs
No server, API, or website is perfect. Mistakes happen, network errors occur, downtime is inevitable. If AWS can experience outages, so can you. It is in everyone’s best interest to be prepared for an outage event.
Detection and response
The costs of downtime are estimated in dollars lost per minute. Early detection and response is critical in minimizing the costs. Investing in an effective uptime monitoring and alerting solution is crucial in being able to quickly detect issues. Effective logging, incident management, and communication tools are also assist in responding effectively.
A proactive approach to monitoring and alerting will pay off many times over when that first major outage occurs.
Communication
Because downtime is inevitable, clearly communicating with end-users should always be part of an effective downtime response plan. If end-users do not believe you are adequately handling an outage because they are not being informed of any updates, that is on you. During an outage, customer support channels and email inboxes will be filling up with inquiries about what is going on. At the minimum, and embedded status message on your website should let users know that there is an ongoing incident that is being worked on.
Other communication methods may be more appropriate on a case-by-case basis:
- dedicated status pages and public incident report page
- social media
- public Slack or Discord support channels
Communication should include regular updates, including a final message to users when service is restored. Communicating effectively will reduce the reputation damage and possibly even instill customer confidence in your business.
Postmortems
“Those who don’t know history are destined to repeat it”
or something like that.
Two separate outages should not share the same root cause. Outages are an opportunity to learn about the weaknesses in your online services. Properly documenting and implementing new safeguards against future outages should be a habit baked into your company culture. There are several tools out there that help with documenting and tracking action items, however, it is ultimately up to you and your team to follow through to avoid the risk of another outage.
Conclusions
Downtime can cost anywhere from a couple hundred to several thousands of dollars per minute. Some portion of the costs, such as direct loss of revenue and sales, are easy to measure and analyze. Other portions, such as reputation, are less apparent yet are probably larger and more lasting than clearer costs.
Downtime is an eventual certainty. The best we can do as developers, product owners, SREs, and business people is to ensure we are prepared for it. Minimizing downtime costs requires a multifaceted approach started with fast detection, diagnosing tools, internal and external communication, and ending with port mortem analysis and executing action items to prevent another incident.
In order to detect downtime, I built a website and API monitoring SaaS, Komonitor, check it out if you are looking for a monitoring and alerting solution!
Thanks for reading this far. If you are interested in more tech content like this, following on Medium is much appreciated, and be sure to check out my Twitter profile!