Chaos Monkey is ready to test your online gaming platform to the limit. The question is "Can you handle the chaos?
Words by David Sachs, CEO of Tomobox and an expert on artificial intelligence and its applications in responsible gaming and AML in iGaming. He has worked with leading operators, platform owners and regulators to implement safeguards that protect players and promote safe betting
Try to imagine a bunch of monkeys running around your data center, pulling cables, trashing routers and wreaking havoc on your applications and infrastructure. Ever more crucial in these days of heated competition between online gaming operators, is player experience. Continuity of operations is “Uber-Alles” and avoiding churn, due to service disruption, is the organisational mantra.
That is exactly what the Netflix engineering team crafted, when they thought out IT infrastructure resiliency both hardware and software. Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers and services in your production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases. It’s also part of a new engineering regime dubbed SRE – Site Reliability Engineering.
A Site Reliability Engineer (SRE) will spend up to 50% of her time doing "ops" related work such as incident resolution, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation making sure the “chaos monkey” is under control.
Why is it fundamental to the player experience?
Continuity of service in the era of the cloud is mandatory and difficult to achieve. There is an infinitesimal number of events that can go wrong. And wrong they will. Multiple vendors service any online operator’s gaming platform, from CRM, to mobile games and payment servers to name just a few. Any disruption of service affects thousands of players that might churn practically “leaving money on the table” for the operator to lose.
SRE and incident management are all the rage today. As Amazon CTO, Werner Vogels, clearly describes a typical major event causing outage: "You see the symptoms, but you do not necessarily see the root cause of it ... you immediately fire off a team whose task is to actually communicate with the customers ... making sure that everyone is aware of exactly what is happening."
Meanwhile, he continues, "internal teams of course immediately start going off and trying to find what's the root cause of this is, and whether we can repair or restore it, or what other kinds of actions we can start taking.". The orchestration of an event is at the heart of incident managers and SREs when disruption occurs.
The future of incident orchestration
Managing a major incident has transformed from an obscure art into a measurable science. Managing an incident is also about informing customers such as affiliates or partners throughout the event. In the day of immediate satisfaction, customers are not looking for you to tell them: ‘wait hold on’, They demand to be in the loop. Meanwhile incident managers are struggling to resolve the issues and make sure transparency rules their action and every stakeholder is informed. Sometimes that means hundreds of people per incident.
Vogels states it very clearly: "I think we can blame ourselves, in terms of not having turned this into sort of a procedure or something that was automated, where we could've had total good control over what the number could be."
This is a key point for Vogels: As you grow and develop, introducing too many points that require human intervention result in points of possible failure. Where possible, automate.
Automation of escalation procedures that kick in as incidents occur, help you master the event and lower time to resolution. Various vendors are adding those capabilities to their monitoring infrastructure. Legacy operational workflows like ServiceNow and infrastructure monitoring companies such as PagerDuty are adding additional layers of incident management to their service suite, through various integrations with third party software. This still leaves the SRE engineer scrambling to deal with multiple ticketing and messaging platforms throughout the incident .
Leading upstarts like Exigence are writing the book anew and redesigning the automation of incident approach around a single pane of glass. A single focal point, to empower SRE teams to efficiently focus on resolution and post-mortem reporting rather than spending endless efforts in self-development and integrations is the future of incident management.
The future of managing endless software and hardware failure points, is based on proactive site reliability integration between NOC and engineering. It also requires a high level of event automation and preparedness in the face of rising incidents.
Block Issue 5 is out:
The Block is a bi-annual publication which illuminates the cutting-edge sectors of AI, blockchain, crypto and emerging tech, with a print run of 5000 delivered to leading brands across the global industry. View our latest issue of the Block here.