Skype broke.
This should serve as a lesson to us all. Sometimes the old ways are the best, and we ignore them at our peril.
The folks at Skype said:
On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.
Yep, that’s right. Microsoft sent out a patch, and it brought down Skype.
TCP is a great example of simple, elegant implementations. TCP is breaking at the seams — it doesn’t support enough ports; it’s a jack-of-all-trades transport that isn’t particularly efficient; it requires a lot of computation; and it’s redundant in a lot of encryption and compression systems. Companies like Netli (acquired by Akamai) built businesses on the inefficiency of TCP. Making TCP efficient is a major factor in how Application Front End products (like Citrix’s NetScaler) speed up sites and reduce the load on servers.
But TCP is elegant. One of the things it does best is recover from problems. Wikipedia tells us:
“Modern implementations of TCP contain four intertwined algorithms: Slow-start, congestion avoidance, fast retransmit, and fast recovery (RFC2581).”
Ethernet does this well, too. When congestion occurs, senders keep talking long enough to make sure everyone heard the congestion, then back off for a random length of time. From Wikipedia, again:
“This can be likened to what happens at a dinner party, where all the guests talk to each other through a common medium (the air). Before speaking, each guest politely waits for the current speaker to finish. If two guests start speaking at the same time, both stop and wait for short, random periods of time (in Ethernet, this time is generally measured in microseconds). The hope is that by each choosing a random period of time, both guests will not choose the same time to try to speak again, thus avoiding another collision. Exponentially increasing back-off times (determined using the truncated binary exponential backoff algorithm) are used when there is more than one failed attempt to transmit.”
Think about that for a second. The guys who built these protocols realized that congestion would happen, and built models for dealing with unpredictable situations by backing off a random time, and for detecting congestion and avoiding it. And this was back in the day when there were only a few nodes on the Internet. Yet they function reasonably well even today.
So why didn’t Skype work properly? Without getting into too many details, the folks at Skype explained:
Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly.
There are two important lessons to be learned here:
- First, it’s critical to look at traffic volumes. Many of the people who buy our UPM equipment used to rely on synthetic testing to monitor their sites. Often, they couldn’t answer simple questions like, “how many users do you have on your site today?” Their marketing department might know, through web analytics tags, how many sessions were active; but there was no way to stitch together traffic levels and performance.
- And second, the Skype incident is a great example of how complex systems can fail in unexpected ways, and how everything on the Internet is intertwingled. Microsoft’s practice of updating and automatically rebooting billions of computers independent of owner control creates tremendous traffic spikes — and this is true of web-connected services such as antivirus updates and desktop plug-ins. But the impact of these spikes isn’t tracked or understood.
Understanding the relationship between load and performance is critical for anyone running a production web application. Applications will break; and without the right information at your disposal, you won’t be able to detect problems or fix them effectively.
With billions of nodes on the Internet and millions of changes a day to production systems, Sod’s Law (a variant of Murphy’s law) is definitely true: “Anything that can go wrong, will.” But it’s also possible to invoke Hanlon’s razor, a corollary to Murphy, that says, “Never assume malice when stupidity will suffice.”




