Designing for Failure: The Architecture of Financial Resilience

We have become virtually "zero tolerant" for digital failures. Customers expect seamless, 24/7 always-on services, regardless of maintenance windows (like backups or software upgrades), peak loads, or even disasters.

For financial institutions, the stakes are even higher: downtime doesn’t just disrupt transactions. It erodes trust, breaches compliance, and damages reputation. Achieving near-zero downtime has become a non-negotiable requirement, driving IT teams to implement increasingly sophisticated strategies to ensure availability, protect data integrity, and reduce recovery times.

At the same time, regulatory frameworks like DORA (Digital Operational Resilience Act) in the EU, the Operational Resilience Framework (CP29/19) in the UK or the Hong Kong Supervisory Policy Manual OR-2, now mandate digital resilience in the financial sector. Compliance is not just about avoiding penalties, it’s about securing trust and ensuring operational continuity.

To meet these demands, a wide range of IT techniques are deployed to enable high availability. However, this significantly increases system complexity, particularly in non-functional testing.

At a high level, these approaches can be grouped into four key categories:

1. Prevent Failures by Design

Design, build, test, and deploy software in a way that minimizes bugs and failures. This includes:

Adherence to coding guidelines, automated code quality checks, and peer reviews
Comprehensive automated testing
Automated deployment pipelines using the same packages across environments, ideally containerized (e.g., Docker)
Using programming languages and frameworks that enforce strict standards (e.g., strong typing, default initialization, zero-division handling)
Choosing mature libraries and open-source components validated by large communities
Designing for modularity to encapsulate issues and enable rapid redeployment of individual modules when needed
…

2. Build for Resilience

Systems must be able to limit the impact of bugs or failures through isolation (to prevent cascading effects) and self-healing (to recover automatically). Tactics include:

Load balancers
Circuit breaker patterns
Timeouts and throttling (bulkheads)
Elastic scalability
Graceful degradation (e.g. serving cached or canned responses, switching to read-only mode…)
Idempotent service design
Retry/recycling logic
Delta-based data handling (to fix past errors and replay only valid operations)
Automatic failover and health checks

These systems should be tested in production via chaos engineering, the practice of intentionally injecting failures to reveal weaknesses before they cause real harm.

3. Redundancy Across All Layers

Redundancy means deploying backup platforms across different infrastructures (e.g. cloud providers, operating systems, data centers) to reduce the risk of simultaneous failures. Key principles include:

Redundancy in compute (stateless components) and storage (stateful components)
Multi-node clusters for both scalability and fault tolerance
When significant synchronization is required between different nodes (e.g., communication between stateless and stateful components), low latency between those nodes is essential. Ideally, these nodes should reside within the same data center or in geographically close data centers. In such cases, they can form a stretched cluster, enabling synchronous operations. However, when latency is too high—typically due to geographical distance, we refer to the secondary location as a Disaster Recovery (DR) site, which is not synchronized in real time but operates in asynchronous mode.
Various redundancy strategies can be applied depending on architectural and business needs:
- Redundancy within a single data center or across Availability Zones (Multi-AZ), where an AZ is a physically distinct data center (or group of centers) within a region. For higher availability and fault tolerance, organizations can opt for multi-regionor even multi-cloud setups.
- Redundancy levels may vary per component (stateless or stateful) and can range from a single replica to multiple copies, depending on criticality. Redundancy types can include hot, warm, cold, or glacier (archival) storage.
- Redundancy configurations include:
- Active/Passive: Only the primary infrastructure is active under normal conditions. The passive environment can vary in readiness, from near-instant failover to requiring manual activation or even full hardware spin-up. As only one side is active, unidirectional synchronization is typically sufficient.
- Active/Active: Both infrastructures are fully operational and synchronized bidirectionally. This setup is more complex but offers better fault tolerance, higher availability, and more efficient use of resources.

4. Enable Rapid Incident Response

When automation fails, manual intervention must be quick, safe, and effective. This requires:

Continuous monitoring of business and technical events, with anomaly detection and alerting. This includes transitioning from basic monitoring to full observability, enabling teams to understand why a failure occurred through logs, metrics, and traces and leveraging AI/ML for predictive analytics and early anomaly detection
Clear error messages and accessible logs for rapid root-cause analysis
Well-documented Business Continuity Plans (BCPs), including fallbacks like manual processing.
Strong incident response teams and blameless postmortems to improve resilience over time.
Tooling to support manual fixes, such as load shedding by filtering non-critical requests, rollbacks and state restoration, isolated fix deployments, service/module disabling, (bulk) manual data corrections, process restarts…
All manual actions should be auditable and reversible wherever possible

For a deeper dive, see 2 of my previous blogs: * Building resilient systems in the Financial Services industry(https://bankloch.blogspot.com/2020/02/building-resilient-systems-in-financial.html) * Building Resilience: Safeguarding Financial Services in the Digital Age(https://bankloch.blogspot.com/2024/08/building-resilience-safeguarding.html)

Ultimately, resilient systems aim to optimize three critical dimensions:

Availability - "How often are we down?"

Measured as a percentage of uptime, availability is often expressed in "nines" (e.g., 99.999% = five nines).
Achieving high availability requires eliminating single points of failure, deploying multi-AZ or multi-region architectures, and conducting regular failure scenario testing (e.g. chaos engineering).
The cost of increasing availability grows exponentially with each additional “nine.”

RTO (Recovery Time Objective) – "How long are we down?"

Defines the maximum acceptable downtime after an incident.
For example, an RTO of 15 minutes means the system must be restored within 15 minutes of failure.
Shorter RTOs demand faster recovery processes, automation, and high operational maturity.

RPO (Recovery Point Objective) – "How much data can we afford to lose?"

Determines the maximum tolerable amount of data loss, measured in time.
For instance, an RPO of 5 minutes means at most 5 minutes of data can be lost during a failure.
Achieving an RPO of zero typically requires synchronous replication, which introduces higher infrastructure demands and latency overhead.

As a rule of thumb. Low RPO is expensive. Low RTO is complex. Achieving both is exceptionally difficult and costly.

In today’s hyperconnected and high-stakes digital environment, resilience is no longer a luxury, it’s a necessity. Whether through robust system design, intelligent redundancy, or agile incident response, financial institutions must continuously evolve their strategies to protect availability, data integrity, and customer trust. But resilience comes with a price. The key is to strike the right balance between technical ambition and pragmatic investment, driven by business value, regulatory pressure, and a relentless focus on operational excellence.

Comments

Transforming the insurance sector to an Open API Ecosystem

1. Introduction "Open" has recently become a new buzzword in the financial services industry, i.e. open data, open APIs, Open Banking, Open Insurance …, but what does this new buzzword really mean? "Open" refers to the capability of companies to expose their services to the outside world, so that external partners or even competitors can use these services to bring added value to their customers. This trend is made possible by the technological evolution of open APIs (Application Programming Interfaces), which are the digital ports making this communication possible. Together companies, interconnected through open APIs, form a true API ecosystem , offering best-of-breed customer experience, by combining the digital services offered by multiple companies. In the technology sector this evolution has been ongoing for multiple years (think about the travelling sector, allowing you to book any hotel online). An excelle...

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

Hypes and marketing buzz words are strongly present in the IT landscape. Often these are existing concepts, which have evolved technologically and are then renamed to a new term, as if it were a brand new technology or concept. If you want to understand and assess these new trends, it is important to reduce the concepts to their essence and compare them with existing technologies , e.g. Integration (middleware) software ensures that 2 separate applications or components can be integrated in an easy way. Of course, there is a huge evolution in the protocols, volumes of exchanged data, scalability, performance…, but in essence the problem remains the same. Nonetheless, there have been multiple terms for integration software such as ETL, ESB, EAI, SOA, Service Mesh… Data storage software ensures that data is stored in such a way that data is not lost and that there is some kind guaranteed consistency, maximum availability and scalability, easy retrieval...

IoT - Revolution or Evolution in the Financial Services Industry

1. The IoT hype We have all heard about the "Internet of Things" (IoT) as this revolutionary new technology, which will radically change our lives. But is it really such a revolution and will it really have an impact on the Financial Services Industry? To refresh our memory, the Internet of Things (IoT) refers to any object , which is able to collect data and communicate and share this information (like condition, geolocation…) over the internet . This communication will often occur between 2 objects (i.e. not involving any human), which is often referred to as Machine-to-Machine (M2M) communication. Well known examples are home thermostats, home security systems, fitness and health monitors, wearables… This all seems futuristic, but smartphones, tablets and smartwatches can also be considered as IoT devices. More importantly, beside these futuristic visions of IoT, the smartphone will most likely continue to be the cent...

A bank account - A concept of the past

Almost every recent article written about banking starts with the statement that the banking industry is being disrupted by new competitors, new innovations and new technologies. Although this statement is definitely true, the extend of the disruption can still be debated. Even the most innovative neo-banks still work with bank (current, saving, term and investment) accounts, cards (credit and debit), traditional credits, existing payment infrastructure… The user experience surrounding the origination and servicing of these products has dramatically improved (and will continue to evolve), but the underlying banking products are not really disrupted. You could argue that banking products are so intertwined with society and our way of thinking about finance, that they can’t be disrupted, but looking at those products you cannot ignore that they are far from an optimal solution in our current digital world. Let’s consider cards for example. Isn’t ...

AI in Financial Services - A buzzword that is here to stay!

In a few of my most recent blogs I tried to demystify some of the buzzwords (like blockchain, Low- and No-Code platforms, RPA…), which are commonly used in the financial services industry. These buzzwords often entail interesting innovations, but contrary to their promise, they are not silver bullets solving any problem. Another such buzzword is AI (or also referred to as Machine Learning, Deep Learning, Enforced Learning… - the difference between those terms put aside). Again this term is also seriously hyped, creating unrealistic expectations, but contrary to many other buzzwords, this is something I truly believe will have a much larger impact on the financial services industry than many other buzzwords. This opinion is backed by a study of McKinsey and PWC indicating that 72% of company leaders consider that AI will be the most competitive advantage of the future and that this technology will be the most disruptive force in the decades to come. Deep Lea...

An overview of 1-year blogging

Last week I published my 60th post on my blog called Bankloch (a reference to "Banking" and my family name). The past year, I have published a blog on a weekly basis, providing my humble personal vision on the topics of Fintech, IT software delivery and mobility. This blogging has mainly been a personal enrichment , as it forced me to dive deep into a number of different topics, not only in researching for content, but also in trying to identify trends, innovations and patterns into these topics. Furthermore it allowed me to have several very interesting conversations and discussions with passionate colleagues in the financial industry and to get more insights into the wonderful world of blogging and more general of digital marketing, exploring subjects and tools like: Search Engine Optimization (SEO) LinkedIn post optimization Google Search Console Google AdWorks Google Blogger Thinker360 Finextra … Clearly it is not easy to get the necessary ...

From app to super-app to personal assistant

In July of this year, KBC bank (the 2nd largest bank in Belgium) surprised many people, including many of us working in the banking industry, with their announcement that they bought the rights to broadcast the highlights of soccer matches in Belgium via their mobile app (a service called "Goal alert"). The days following this announcement the news was filled with experts, some of them categorizing it as a brilliant move, others claiming that KBC should better focus on its core mission. Independent of whether it is a good or bad strategic decision (the future will tell), it is clearly part of a much larger strategy of KBC to convert their banking app into a super-app (all-in-one app) . Today you can already buy mobility tickets and cinema tickets and use other third-party services (like Monizze, eBox, PayPal…) within the KBC app. Furthermore, end of last year, KBC announced opening up their app also to non-customers allowing them to also use these thi...

Marketplaces in the financial industry - Here to stay?

Marketplaces are hip and trendy on the internet and will likely evolve even more in the near future. In some markets (like food delivery, transportation, commerce, holiday…) they already represent double digit market shares (e.g. in 2018 $1.86 trillion was spent globally on the top 100 online marketplaces), but for the financial services sector, their impact (even though there are a few unicorn FinTechs in this space) on the industry is still limited. Any form of intermediation (travel agents, taxi dispatchers…) will likely be replaced by a modern, digital and more direct equivalent, i.e. a digital marketplace. As the business of banks is exactly the intermediation between people having excess money and people needing money, the financial services sector will be significantly impacted. Furthermore, marketplaces are strongly intertwined with other concepts like the gig-economy, the sharing-economy and the API-economy . All these trends will ultimately...

Peer-to-peer payments - A crucial component towards a cashless society

The Corona crisis has led to an exponential decrease in the usage of cash , due to the associated hygienic problems and the enormous rise of eCommerce. While in commercial transactions cash is disappearing rapidly, it is however still commonly used for informal money exchanges , like between friends, family, colleagues…, but also those payments are becoming more and more digital, thanks to peer-to-peer payment (P2P) solutions . These solutions drastically improve the user experience (removing friction) for both the person initiating the payment (= the payer) and the person receiving the payment (= the recipient), compared to a simple initiation of a wire transfer in a banking app. Before clarifying where those solutions bring most value, it is important to first identify the typical use cases , where peer-to-peer payments are most common, as the P2P payment solutions need to optimally accommodate these use cases: Family giving a cash gif...

Neobanks should find their niche to improve their profitability

The last 5 years dozens of so-called neo- or challenger banks (according to Exton Consulting 256 neobanks are in circulation today) have disrupted the banking landscape, by offering a fully digitized (cfr. "tech companies with a banking license"), very customer-centric, simple and fluent (e.g. possibility to become client and open an account in a few clicks) and low-cost product and service offering. While several of them are already valued at billions of euros (like Revolut, Monzo, Chime, N26, NuBank…), very few of them are expected to be profitable in the coming years and even less are already profitable today (Accenture research shows that the average UK neobank loses $11 per user yearly). These challenger banks are typically confronted with increasing costs, while the margins generated per customer remain low (e.g. due to the offering of free products and services or above market-level saving account interest rates). While it’s obvious that disrupting the financial ma...

Bankloch

Search This Blog