Skip to main content

Building resilient systems in the Financial Services industry

Introduction

With customers becoming more and more demanding, they expect their financial services to be 24/7 available. Unfortunately the current monolithic architectures in the financial services industry are not very fault resilient. Banks and insurers have invested heavily in redundant infrastructure to increase the availability of their application landscape, but these investments only provide a solution for a limited set of failures (mostly the hardware failures). Any uncovered failures (like e.g. network issues, application failures due to software bugs…​) and maintenance activities (like software releases, operating system upgrades, database maintenance activities…​) still lead to service unavailability.
With customers nowadays being used to 24/7 available, online services, like Facebook and Google, don’t understand anymore why banks and insurers are not able to provide the same levels of availability. This provides an extra argument (next to the need for increased agility) for the financial sector to evolve their monolithic architectures to microservice based architectures, which are designed for failure. Such an architecture breaks the functionality in highly decoupled, small, simple and distributed chunks, which can fail without affecting the entire system.
Such a microservice based architecture is not a means to avoid failure. In fact, such an architecture, as it consists of many small components with a lot of communication in between, has even a higher probability for isolated failures. Microservices should therefore be explicitly designed for failure. This results in large scale resilience at the cost of some small-scale unreliability.
Such a resilient design requires safety mechanisms at various levels.

Mechanisms to automatically handle and react to failure

Microservices should be designed to be:
  • Resilient, i.e. guard themselves from failures of dependent microservices, thus avoiding the ripple effect (i.e. avoid that one microservice brings down the entire system).
  • Self-healing, i.e. quickly identify failure and recover as soon as possible (by automatically restoring service).
Different techniques exist to construct such resilient and self-healing services:
  • Load balancer: a load balancer is typically used for spreading the load across multiple instances of a service, but can also help for making services more resilient. The balancer can perform regular service health checks and remove failed service instances from the pool.
  • Circuit breaker pattern: this pattern is used to detect failures and encapsulate them (open the circuit after X unsuccessful communication attempts and close the circuit again, when a service is restored), thus avoiding failures to reoccur constantly (and impacting dependent services).
  • Timeouts: when a service is slow, a timeout mechanism should be present. This ensures that services don’t wait too long for a response, thus avoiding to increase pending threads (and reach the maximum number of concurrent threads).
  • Bulkheads / Throttling: this technique limits the number of concurrent calls to a component, by only accepting a limited number of calls per time unit. This allows to limit the number of resources (threads) waiting for a reply from the component.
  • Elastic scalability: dynamic and rapid spinning up and down of services (often via Docker containers) allows to react quickly and pro-actively on failure. E.g. a system can react on increased response times (which could lead to failures) by spinning up extra nodes in a distributed system. Several cluster management tools like Apache Mesos, Google’s Kubernetes and Docker Swarm, exist to manage this dynamically.
  • Graceful degradation: the typical behavior of a service, communicating with a dependent service which is down, is to stop processing the request and reply with an internal server error. This will accelerate the ripple effect and result in bad user experience. Instead services should implement graceful degradation, i.e. provide a plan B as soon as a service detects failure of a dependent service. Typical examples of graceful degradations are:
    • Use a cached response (which might not be up-to-date anymore), instead of real-time data retrieval.
    • Provide a canned response, i.e. a templated or scripted standard response
    • Call a different service, i.e. a backup service
    • Perform a simplified calculation in the service itself (instead of calling the dependent service)
    • Switch the entire application to read-only mode
  • Build idempotent services: an idempotent service is a service, which gives the same result when called multiple times with the same input parameters. Such services are more resilient, as a service call can be retried (e.g. in case of time-out) without the risk of bringing the system in an inconsistent state. An idempotent service can be achieved by:
    • Avoid sending "deltas", but instead sending "current-value" type messages. This is not always possible, e.g. sending a payment or a securities order will always be a "delta" operation.
    • Filter out duplicates. This can be achieved by adding a unique identifier in the messages and tracking this unique identifier. If an identifier was already processed successfully, the request is rejected. This approach has the disadvantage that a service should store state, which makes the service more complex and subject to failure. This is especially the case if there are multiple instances of the service, as the list of successful requests should then be shared across the different instances, which could be located on servers in different regions.
  • Retry / Recycling logic: a service should implement retry/recycling logic, i.e. when a call to a dependent service fails, the request should be automatically retried/recycled later. After a number of unsuccessful attempts, the request should be moved to an error queue/cache (dead letter queue/cache).

Mechanisms to identify and debug failures

Microservices should also be designed to detect and even predict abnormalities. This allows to proactively react to potential failures. If a failure still occurs, a maximum of information should be collected, so that the failure can be easily analysed and debugged.
  • Logging: all service activities, both successful and unsuccessful, should be logged. As a microservice architecture is composed of a lot of small components, the system should foresee a central log microservice, so that the log entries of the different microservices can be aggregated in 1 central view.
  • Tracing: as a request will be treated by multiple microservices, it is often difficult to trace back the path of a (failed) request. This is solved by associating a correlation token (also called trace or tracking ID) to a request. This token is passed from microservice to microservice and included in all log entries. By filtering the central log on the correlation identifier, it is possible to find back all actions performed in the context of the request.
  • Clear error messages: when a service generates an error, it is important to return a clear error message. Such an error message should consist of: a unique error code (uniquely identifying the type of error), a clear error message for the end-user to know what happened and what he should do next, a link to documentation providing additional information on the error and a unique error identifier. This unique error identifier allows to search directly in the central log.
  • Monitoring: breaking a system up in smaller services makes monitoring significantly more complex, especially since the correct working of each individual service is not sufficient to ensure that the integration points are also correctly working.
    Real-time monitoring of the application at various levels is therefore essential:
    • Technical monitoring like monitoring resource usage (memory, disk space, CPU…​), monitoring packets going over the wire, monitoring the JVM for thread usage, memory growth and kick off of the JVM Garbage Collector…​
    • Applicative monitoring like the number of service requests per second, service response times…​
    • Business monitoring like the number of payments per minute received
    • Semantic monitoring, which combines test execution and real-time monitoring to continuously evaluate the applications. Those tests mimic user actions via fake events and check if the system behaves as expected.
  • Alerting: to avoid having to monitor a system continuously for anomalies, it is also important to categorize exceptions, so that alerts can be generated when important anomalies occur. This ensures that operators are (pro-actively) alerted, when they need to intervene manually.

Mechanisms to manually intervene on failures

An important part of designing systems for failure is also to provide the tools for operators to manually intervene on failures.
Such tools should provide following functionalities:
  • Move a service request from the error queue back to the normal processing queue (e.g. after bug has been fixed).
  • Cancel a service request from the error queue (i.e. if no longer needed to be processed).
  • Adapt content of a service request (one by one or in bulk). This functionality should be fully audited, so that all adaptations can be fully tracked.
  • Manually spin-up or spin-down instances of services. This can be used to deploy an updated version of services (after fix), increase/decrease service capacity, disable a specific service (by shutting down all instances of the service) or change the topology of the distributed service architecture.

Mechanisms to test for failure

The best way to validate the resilience (failure recovery logic) of a microservice based architecture is to introduce regularly diverse types of failures (fault injection) in the system, i.e. breaking things on purpose. This should be done both in the test environments as in the production environment, as often a production environment has specific settings not existing in other environments.
In follow-up of Amazon, many companies now organize Gameday exercises. During such an exercise, companies test the response of systems, software and people to a particular disastrous event and this directly in production and during a standard business day.
Many banks and insurers also execute once or twice a year DRP (Disaster Recovery Plan) tests, but unlike the major technology firms, these tests typically happen during weekends (often even with downtimes announced to customers) and are a lot more limited in scope (typically limited to testing the DRP of 1 or a few applications).
Other companies even go one step further and introduce production failures at random. Independent whether failures are injected at random or planned, these companies all use a set of tools to handle the creation of the failures, the roll-back and clean-up of the induced failures and the monitoring of the failure recovery. Most known examples of such tools are the Simian Army tools from Netflix or the Gremlin failure-handling testing framework (providing Failure as a Service).
Such tools allow to induce several types of failures, such as:
  • Shut down virtual machines or kill instances and services (e.g. Chaos Money). This can be extended to simulate an outage of an entire availability zone (e.g. Chaos Gorilla).
  • Induce artificial delays in the communication layer to simulate service degradation and check if the dependent services respond appropriately (e.g. Latency Monkey). By making these delays very large, companies can even simulate a service being down, without physically bringing these instances down.
  • Simulate an unexpected service answer (i.e. answer not in line with the expected and documented answer).
  • Simulate noisy neighbours in a cloud environment. The noisy neighbour problem is caused by the fact that in a cloud environment physical servers are split in multiple virtual machines, which are shared. Since it is typically very difficult to partition disks, a neighbor doing very large amounts of disk I/O can result in inferior performance for the other virtual machines on the same physical server.
This list can be extended with a lot of other types of failures, like JVM leaks, network issues, DNS failure, DoS attacks…​

Generalise design for failure

As explained above, designing a resilient microservice based architecture requires a lot of upfront design and introduces a lot of additional complexity. Unfortunately in today’s competitive financial landscape, such a design is no longer an option but a necessity.
The above mechanisms focus nonetheless only on non-functional failures. Just as important is dealing with functional failures, associated to application bugs and errors/flaws in the requirements.
A system should therefore also be designed to
  • Identify and analyse bugs or requirement flaws as quickly as possible
  • Isolate the effect of bugs as much as possible
  • Create and deploy fixes as quickly as possible.
A few mechanisms supporting such a design are:
  • Fast, easy and automated deployment (e.g. through Docker containers)
  • Continuous business activity monitoring (in order to see drops in customer satisfaction or conversion rates as quickly as possible)
  • Concurrent deployment of different versions of a service (allowing canary testing and A/B testing)
  • Semantic monitoring
  • …​

Setting up a resilient microservices architecture

Given the complexity of setting up such a resilient microservices architecture, one should avoid reinventing the wheel and setting up custom implementations for the above failure-management mechanisms. Instead banks and insurers should partner with a microservices framework and platform, to respectively build and run such an architecture. Such a product provides the above described mechanisms pre-packaged and glues together different micro-services in a fault tolerant way.
When banks and insurers can correctly implement such a resilient architecture, they will be able to create customer-centric financial services, which can compete with the services offered by the big technology giants.

Comments

Popular posts from this blog

Transforming the insurance sector to an Open API Ecosystem

1. Introduction "Open" has recently become a new buzzword in the financial services industry, i.e.   open data, open APIs, Open Banking, Open Insurance …​, but what does this new buzzword really mean? "Open" refers to the capability of companies to expose their services to the outside world, so that   external partners or even competitors   can use these services to bring added value to their customers. This trend is made possible by the technological evolution of   open APIs (Application Programming Interfaces), which are the   digital ports making this communication possible. Together companies, interconnected through open APIs, form a true   API ecosystem , offering best-of-breed customer experience, by combining the digital services offered by multiple companies. In the   technology sector   this evolution has been ongoing for multiple years (think about the travelling sector, allowing you to book any hotel online). An excellent example of this

Are product silos in a bank inevitable?

Silo thinking   is often frowned upon in the industry. It is often a synonym for bureaucratic processes and politics and in almost every article describing the threats of new innovative Fintech players on the banking industry, the strong bank product silos are put forward as one of the main blockages why incumbent banks are not able to (quickly) react to the changing customer expectations. Customers want solutions to their problems   and do not want to be bothered about the internal organisation of their bank. Most banks are however organized by product domain (daily banking, investments and lending) and by customer segmentation (retail banking, private banking, SMEs and corporates). This division is reflected both at business and IT side and almost automatically leads to the creation of silos. It is however difficult to reorganize a bank without creating new silos or introducing other types of issues and inefficiencies. An organization is never ideal and needs to take a number of cons

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

Hypes and marketing buzz words are strongly present in the IT landscape. Often these are existing concepts, which have evolved technologically and are then renamed to a new term, as if it were a brand new technology or concept. If you want to understand and assess these new trends, it is important to   reduce the concepts to their essence and compare them with existing technologies , e.g. Integration (middleware) software   ensures that 2 separate applications or components can be integrated in an easy way. Of course, there is a huge evolution in the protocols, volumes of exchanged data, scalability, performance…​, but in essence the problem remains the same. Nonetheless, there have been multiple terms for integration software such as ETL, ESB, EAI, SOA, Service Mesh…​ Data storage software   ensures that data is stored in such a way that data is not lost and that there is some kind guaranteed consistency, maximum availability and scalability, easy retrieval and searching

IoT - Revolution or Evolution in the Financial Services Industry

1. The IoT hype We have all heard about the   "Internet of Things" (IoT)   as this revolutionary new technology, which will radically change our lives. But is it really such a revolution and will it really have an impact on the Financial Services Industry? To refresh our memory, the Internet of Things (IoT) refers to any   object , which is able to   collect data and communicate and share this information (like condition, geolocation…​)   over the internet . This communication will often occur between 2 objects (i.e. not involving any human), which is often referred to as Machine-to-Machine (M2M) communication. Well known examples are home thermostats, home security systems, fitness and health monitors, wearables…​ This all seems futuristic, but   smartphones, tablets and smartwatches   can also be considered as IoT devices. More importantly, beside these futuristic visions of IoT, the smartphone will most likely continue to be the center of the connected devi

Neobanks should find their niche to improve their profitability

The last 5 years dozens of so-called   neo- or challenger banks  (according to Exton Consulting 256 neobanks are in circulation today) have disrupted the banking landscape, by offering a fully digitized (cfr. "tech companies with a banking license"), very customer-centric, simple and fluent (e.g. possibility to become client and open an account in a few clicks) and low-cost product and service offering. While several of them are already valued at billions of euros (like Revolut, Monzo, Chime, N26, NuBank…​), very few of them are expected to be profitable in the coming years and even less are already profitable today (Accenture research shows that the average UK neobank loses $11 per user yearly). These challenger banks are typically confronted with increasing costs, while the margins generated per customer remain low (e.g. due to the offering of free products and services or above market-level saving account interest rates). While it’s obvious that disrupting the financial ma

PFM, BFM, Financial Butler, Financial Cockpit, Account Aggregator…​ - Will the cumbersome administrative tasks on your financials finally be taken over by your financial institution?

1. Introduction Personal Financial Management   (PFM) refers to the software that helps users manage their money (budget, save and spend money). Therefore, it is often also called   Digital Money Management . In other words, PFM tools   help customers make sense of their money , i.e. they help customers follow, classify, remain informed and manage their Personal Finances. Personal Finance   used to be (or still is) a time-consuming effort , where people would manually input all their income and expenses in a self-developed spreadsheet, which would gradually be extended with additional calculations. Already for more than 20 years,   several software vendors aim to give a solution to this , by providing applications, websites and/or apps. These tools were never massively adopted, since they still required a lot of manual interventions (manual input of income and expense transaction, manual mapping transactions to categories…​) and lacked an integration in the day-to-da

Can Augmented Reality make daily banking a more pleasant experience?

With the   increased competition in the financial services landscape (between banks/insurers, but also of new entrants like FinTechs and Telcos), customers are demanding and expecting a more innovative and fluent digital user experience. Unfortunately, most banks and insurers, with their product-oriented online and mobile platforms, are not known for their pleasant and fluent user experience. The   trend towards customer oriented services , like personal financial management (with functions like budget management, expense categorization, saving goals…​) and robo-advise, is already a big step in the right direction, but even then, managing financials is still considered to be a boring intangible and complex task for most people. Virtual (VR) and augmented reality (AR)   could bring a solution. These technologies provide a user experience which is   more intuitive, personalised and pleasant , as they introduce an element of   gamification   to the experience. Both VR and AR

Beyond Imagination: The Rise and Evolution of Generative AI Tools

Generative AI   has revolutionized the way we create and interact with digital content. Since the launch of Dall-E in July 2022 and ChatGPT in November 2022, the field has seen unprecedented growth. This technology, initially popularized by OpenAI’s ChatGPT, has now been embraced by major tech players like Microsoft and Google, as well as a plethora of innovative startups. These advancements offer solutions for generating a diverse range of outputs including text, images, video, audio, and other media from simple prompts. The consumer now has a vast array of options based on their specific   output needs and use cases . From generic, large-scale, multi-modal models like OpenAI’s ChatGPT and Google’s Bard to specialized solutions tailored for specific use cases and sectors like finance and legal advice, the choices are vast and varied. For instance, in the financial sector, tools like BloombergGPT ( https://www.bloomberg.com/ ), FinGPT ( https://fin-gpt.org/ ), StockGPT ( https://www.as

From app to super-app to personal assistant

In July of this year,   KBC bank   (the 2nd largest bank in Belgium) surprised many people, including many of us working in the banking industry, with their announcement that they bought the rights to   broadcast the highlights of soccer matches   in Belgium via their mobile app (a service called "Goal alert"). The days following this announcement the news was filled with experts, some of them categorizing it as a brilliant move, others claiming that KBC should better focus on its core mission. Independent of whether it is a good or bad strategic decision (the future will tell), it is clearly part of a much larger strategy of KBC to   convert their banking app into a super-app (all-in-one app) . Today you can already buy mobility tickets and cinema tickets and use other third-party services (like Monizze, eBox, PayPal…​) within the KBC app. Furthermore, end of last year, KBC announced opening up their app also to non-customers allowing them to also use these third-party servi

Eco-systems - Welcome to a new cooperating world

Last week I attended the Digital Finance Summit conference in Brussels, organized by Fintech Belgium, B-Hive, Febelfin and EBF. A central theme of the summit was the cooperation between banks and Fintechs and more in general the rise of ecosystems. In the past I have written already about this topic in my blogs about "Transforming the bank to an Open API Ecosystem ( https://www.linkedin.com/pulse/transforming-bank-open-api-ecosystem-joris-lochy/ ) and "The war for direct customer contact - Banks should fight along!" ( https://www.linkedin.com/pulse/war-direct-customer-contact-banks-should-fight-along-joris-lochy/ ), but still I was surprised about the number of initiatives taken in this domain. In my last job at The Glue, I already had the pleasure to work on several interesting cases: TOCO   ( https://www.toco.eu ): bringing entrepreneurs, accountants and banks closer together, by supporting entrepreneurs and accountants in their daily admin (and in the f