Coinbase failed twice in the past fortnight as users on the platform flocked at peak trading periods on April 29th, 2020, and May 9th, 2020. The U.S. based exchange released its postmortem results on what caused the double failure of the Coinbase exchange and Coinbase Pro systems majorly focusing on the sudden increase in traffic levels on the website.
Coinbase Releases Post-Mortem on System Failures
On April 29TH, 2020 at around 10.30 AM Pacific Time (used hereinafter) Coinbase mobile applications and the website became unavailable to users globally as the traffic levels on the site caused a failure to the API that runs the system. The system failed for over an hour before coming back online. According to the Coinbase statement, the failure was triggered by an increase in the rate of connections on its primary database causing “an elevated error rate across all API requests” passing through the database.
The Coinbase development team tried to restore the system following the failures but improper connections caused the system to crash once again leading to the hour-long outage.
Barely two weeks after the outage, both Coinbase, and Coinbase Pro reported failures at around 17:18 on May 9th, 2020 as the traffic on the site surged on Bitcoin’s collapse to $8,100 levels. This was caused by the increased latency levels, which in turn saw the error rates in the APIs spike as the users continued to try to log in unsuccessfully. The statement reads,
“The elevated error rate was amplified by our load balancer killing otherwise-healthy application instances that failed health checks.”
The exchange is working to provide a toolkit system to quickly discover and remove external services that may be increasing the latency of the website or applications.
Coinbase Working to Prevent Future System Failures
Coinbase aims to implement new measures to prevent future system failure as witnessed in the past. The exchange, which has been in service since 2012 will adjust its health check logic to prevent the removal of perfectly working applications from the load balancer as in the May 9th case. On handling massive traffic in times of market volatility the exchange statement reads,
“We’re changing our database deployment topology to reduce our overall connection count, limit connection spikes, and separate the routing and daemon processes of the database to limit competition for host resources.”
The exchange is also rolling out a safeguarding measure to prevent HTTP failures and contain any effect to only a small section of the erratic systems not to the whole system.