Declassified Documents Reveal How Amazon Struggled to Fix the Prime Day Pandemonium
CNBC obtained internal documents which reveal how Amazon struggled to procure enough server storage space on Prime Day. As a cover-up, they had to publish a demoted front-page subsequently killing all its international traffic. Market predicts that sales plummeted for Amazon as the problem occurred at the start of Prime Day. The Prime Day is the biggest sales day for Amazon, annually.
Glancing at these documents, experts are suggestive that Amazon’s auto-scaling feature may have failed on this day. Their conclusion is based on the fact that Amazon had to add servers manually, on Prime Day.
The e-commerce behemoth updated, “Currently out of capacity for scaling” & “Looking at scavenging hardware.”
Sable, the software that powers Amazon’s retail and digital arms, malfunctioned on Prime Day. This internal system is responsible to provide storage and computation to Amazon’s main businesses. This debacle resulted in –
- Malfunctioning of multiple services that depend on Sable’s stability
- Video Playback
- Amazon Prime & Prime Now
- Product scans and product packaging
The documents detail how Amazon faced an uphill task in getting the problem resolved. This is surprising considering the amount of experience Amazon has in running a website of this magnitude. This is also surprising because Amazon is one of the biggest Cloud Service Platforms, globally.
Matthew Caeser commented stating that there are possibly two reasons why the Prime Day collapse happened.
- Amazon experienced more traffic than it expected
- Their software eco-system developed a complex bug
Prime Day 2018 ended with –
- Heavy web traffic for all 36 hours of the sale
- 100 million products sold
- Happy sellers
Amazon claims that there were minimal losses due to the downtime. Amazon made a statement two hours after the site crash. They only said, “We are working to fix this issue, quickly.”
CNBC got their hands on an internal email from Amazon’s global retail CEO, Jeff Wilke. The email clearly indicates Jeff’s disappointment. He further states that he never wants another event like Prime Day 2018 and that the company will ensure that they do whatever it takes to avoid a glitch like this in the future. Amazon has not commented on it as yet.
The dawn of the debacle
Amazon’s Headquarters in Seattle, Washington, started to see glitches in its website by noon, Pacific time. Amazon began firefighting and made changes to their IT systems.
Here’s the timeline of how events unfolded—
Matthew Caesar noted that the key change appeared to be a dampener of a front page of the Amazon website. Amazon did it in order to reduce website traffic and load on their servers.
Amazon blocks international website traffic. This is done to reduce pressure on Sable.
25% traffic is permitted to its default front page.
12:40 PM to 1 PM
Amazon’s engineers managed to recover Sable for two minutes only. Sable suggestions were certain about blocking further traffic to the website. Website condition continued to deteriorate till 1:05 PM.
Website performance suddenly improves. Order rates see huge spikes. Internal sources confirm that a chaotic Amazon office saw 300 executives going to an emergency meeting.
Henning Schulzrinne said, “Amazon is struggling to restore order. Shutting off is better because the more people try and reload a page, the worst the problem gets.”
Trying to Prevent a Sable Suicide
Caeser was sure that the main reason for Amazon’s embarrassment was the malfunctioning of the auto-scaling feature. Amazon is just not ready for the rush and instead of shutting off the website they chose to block traffic. He further supports his statement by saying that Amazon adding server power manually is a clear indicator that Auto-Scaling had failed.
Sable is a critical component of Amazon’s retail eco-system. Last year, on Prime Day, Amazon processed an astonishing 63.5 million leveraging Sable’s capabilities. Sable is used by 400 of Amazon’s teams and safeguarding Sable Infrastructure is critical to Amazon.
Carl Kesselman applauded Amazon by stating that if it were any other website, it would have definitely crashed. He also mentioned that it is unimaginable for any other enterprise to work at Amazon’s scales. He contemplates the existence of a genuine bug versus a real problem in Amazon’s Information systems.
This was the first time that Amazon Prime Day was conducted under Neil Lindsay, Amazon’s VP of worldwide marketing and Prime.