At TT, we take outages personally, and any time there is a major issue on any of our platforms, it’s “all hands on deck.” Still, we would much rather avoid outages and unplanned downtime altogether. So, I would like to take this opportunity to let you know what we’ve done over the last few years to avoid system outages and how the TT platform ups the ante with respect to delivering superior system availability. And while much of this blog post relates to the newer TT platform, rest assured, we are committed to delivering the highest possible availability on all platforms: TT, X_TRADER® ASP and TTNET™.
Operating and maintaining a global trading platform is not a trivial endeavor and certainly not your typical IT operation. These systems are by their very nature complex—normalizing and bridging a multitude of different market and customer systems, each speaking a different language and each with different needs—and high availability is a must. Throw multi-region regulatory compliance, security requirements, performance and counter-party upgrades in the mix and it’s not hard to see how the systems get exponentially more difficult to maintain and operate.
The good news is that even with all that complexity, most failure scenarios can be managed in advance by building failover and redundancy capabilities into the system. Typical approaches involve redundancy of core services and components along with a set of playbooks or standard operating procedures to walk users through the resolution. For the most part, the anticipated component failures are generally quick to recover and low impact to the user. Unfortunately, however, extended outages can occur on rare occasions. This usually happens when changes are introduced into the system that are not fully vetted and tested or just implemented incorrectly. In these cases, even the best laid plans for resiliency will fail you. Even though changes are the root cause of greater than 90% of unplanned downtime, we have to continue to deliver new features, markets and performance upgrades that give our users the necessary business advantages they need to stay competitive.
This represents the single greatest point of conflict in our operations: the need to continually innovate while minimizing the risk of doing so.
We understand that at the end of the day, it doesn’t matter if you have the fastest, most feature-rich and powerful trading platform if it’s not available when you need to trade. So given that we have to make changes, how do we do so while eliminating outages and downtime as a result of changes? For starters, with much better planning, testing and communication.
An upgrade to our change processes
While TT has a robust change-management practice in place, there is room for improvement that we expect will result in a reduction of outages. We are making sure all change records have detailed rollback plans and post-change checkout plans. Both the rollback and checkout plans are vetted in advance of the change’s implementation by the operations teams responsible for supporting the service. Cross-region support and operations teams are taking a larger role in the change review process. This keeps everyone up to date globally on changes in the platform increasing checkout coverage and reducing downtime when issues do occur.
In addition, the regional teams can provide critical feedback on local events that might impact volatility. On top of those changes, we will be providing better forecasting of changes to our customers, so they can be prepared if something goes wrong. But even with a perfect set of processes in place, availability and the mean time to recover from an outage are ultimately limited by the architecture of the platform.
Platform architecture can improve system availability
At their core, all of our platforms are built on resilient infrastructure and offer diverse access to exchanges and services in data centers geographically separated by hundreds of miles. This provides our customers with choices for tailoring business continuity and disaster recovery solutions to their needs on our platform. In our opinion, that’s table stakes for any resilient professional platform in today’s trading world.
One of the goals we set for ourselves when we built the TT platform was to deliver professional trading services in a manner that made everything easier for our partners, and that includes making the platform highly available and component failures a non-event. While redundant hardware deployments and highly meshed networks are still at the core of providing a resilient solution, true high availability has to be built into the applications that sit atop the underlying infrastructure, and that is what we’ve done on the TT platform.
All applications on TT that comprise a service are “clustered,” which means a couple of things. First, all transaction processing functions are running on two or more discrete hosts. If one host fails, the other hosts in the cluster are ready to pick up the workload. Take, for example, our order gateways, which have designated accounts and exchange connections for which they are responsible. If one of these hosts fails, the responsibility for these connections will immediately (and automatically) be assumed by another running host, likely without the user noticing. This even works across data centers for exchanges where we host a point of presence in alternate locations.
Let’s contrast this with the operation of our X_TRADER solution, which does not support clustered applications. If an X_TRADER order gateway fails, users must be manually reconfigured to trade on an alternate flavor gateway for the same exchange. This failover configuration requires users to close out all working orders under the existing session and reestablish their orders in the market under the new session. It’s a cumbersome and time-consuming process for the users and system administrators.
On the TT platform, the state of all services is centrally stored and globally replicated, so the state of any transaction can be recovered by any running process in the cluster. Typical recovery occurs for most apps in about 30 seconds.
Second, we are mindful of how applications are deployed on the infrastructure to address single points of failure in the system. Application clusters are intelligently distributed at deployment time across the server farm to ensure no single component failure brings down a service. Moreover, discovery of the infrastructure and resilient deployment is fully automated and therefore repeatable and guaranteed. This approach is much more efficient than over-engineering the infrastructure and keeping your fingers crossed that it never fails. The X_TRADER-based platforms, which are resilient in their own right, require much more operational overhead to manage a resilient deployment—a burden we are happy to bear for our customers, but one from which we all would like to eventually move away.
We’ve even extended the high-availability architecture to our dedicated hosting services, TT Prime and TT Reserved, which give customers the option to run TT components on private infrastructure in our colocated data centers. These dedicated servers are built specifically for each customer’s application to optimize performance of the TT platform.
The architecture of the TT platform not only makes it easier for users to access new markets, adopt new features and realize significant performance gains, but it makes it exponentially easier to manage and operate—which means we can roll out fixes, upgrades and new features with less unplanned downtime. Because TT is fully hosted, we can continuously improve our deployment automation, monitoring and testing, making it possible to push updates in a non-disruptive manner—we’ve built a factory-style pipeline from development to production. And because deployment is automated, if there happens to be an issue with a feature rollout, we can quickly roll back the software in a manner that’s virtually seamless to the end user.
Finally, instrumentation of our services is built into the TT platform. Availability and performance metrics are captured in real time, and we are making those metrics available to everyone via our status page at status.trade.tt. Over time, we will continue to add more service metrics, such as order-routing and hedge latencies, global backbone latencies and client connection bandwidth utilization.
While the TT platform represents the most resilient piece of technology we’ve ever built, we nevertheless know that in the face of outages, there is no substitute for human decision-making and, more importantly, support.
TT Support is there when you need it
Regardless of which platform you’re using, we stand behind all of our services with real-time support that operates in three regions: Americas, EMEA and Asia/Pacific. These individuals are not only versed in the back-end operation of the trading platforms, but many have traded at some point in their careers and understand the client-facing applications intimately.
The first-level team is backed up by a global team of reliability engineers that manage ongoing platform availability, monitoring, capacity and performance.This team has strong software and networking expertise, which gives them the ability to go deep into the application—all the way to the packets that traverse the network and efficiently and expeditiously identify the cause of failure.
And if the global operations team can’t resolve the issue, they can pull in our application development staff which operates out of Chicago, London, Pune and Singapore. In the end, this means we are equipped to provide real-time engineering support around the clock.