This is a short placeholder post to provide an update for all our users about the events of this morning. We will provide an exhaustive level of detail in a subsequent post once the investigation is complete.
Before getting into the explanation of today’s outage, we want to sincerely apologize for the consequences that these events had on many of you. What happened today was not acceptable, and we realize we need to work hard to restore your trust in us.
Beginning at 8:30 AM CT, order routing to CME from TT was interrupted for all users as a result of problems within the TT infrastructure. This continued intermittently, with varying impact to different user groups over the following two hours.
TT leverages an application for “service discovery” which maintains a list of all running servers so that, for example, a pre-trade risk server can find the order routing gateway it needs to route a particular customer’s order. This same service is leveraged by the order routing gateways to facilitate failover. In other words, this service acts as the “cluster manager” allowing each gateway to know its peers.
The root cause of today’s issue was that the service discovery and cluster management services crashed and then struggled to maintain connectivity with our gateways that connect to CME, even after the crash recovery. The order routing gateways are programmed to shut themselves down when they lose connectivity with the cluster manager so that a different gateway in the cluster can take over servicing the connections from the unhealthy/disconnected gateway. Unfortunately, because all gateways lost connectivity to the cluster manager around 8:30, all exchange sessions between TT and CME were interrupted. From 8:30 until 10:30, the gateways were intermittently losing connectivity to the cluster management service, and this cycle continued. The issue was remediated by deploying a new set of cluster management services on new physical servers in the Aurora data center.
We will provide complete transparency into the details of the event over the next 24-48 hours once the technical teams complete their investigation, including plans to ensure this never occurs again.