We have become virtually "zero tolerant" for digital failures. Customers expect seamless, 24/7 always-on services, regardless of maintenance windows (like backups or software upgrades), peak loads, or even disasters.
For financial institutions, the stakes are even higher: downtime doesn’t just disrupt transactions. It erodes trust, breaches compliance, and damages reputation. Achieving near-zero downtime has become a non-negotiable requirement, driving IT teams to implement increasingly sophisticated strategies to ensure availability, protect data integrity, and reduce recovery times.
At the same time, regulatory frameworks like DORA (Digital Operational Resilience Act) in the EU, the Operational Resilience Framework (CP29/19) in the UK or the Hong Kong Supervisory Policy Manual OR-2, now mandate digital resilience in the financial sector. Compliance is not just about avoiding penalties, it’s about securing trust and ensuring operational continuity.
To meet these demands, a wide range of IT techniques are deployed to enable high availability. However, this significantly increases system complexity, particularly in non-functional testing.
At a high level, these approaches can be grouped into four key categories:
1. Prevent Failures by Design
Design, build, test, and deploy software in a way that minimizes bugs and failures. This includes:
Adherence to coding guidelines, automated code quality checks, and peer reviews
Comprehensive automated testing
Automated deployment pipelines using the same packages across environments, ideally containerized (e.g., Docker)
Using programming languages and frameworks that enforce strict standards (e.g., strong typing, default initialization, zero-division handling)
Choosing mature libraries and open-source components validated by large communities
Designing for modularity to encapsulate issues and enable rapid redeployment of individual modules when needed
…
2. Build for Resilience
Systems must be able to limit the impact of bugs or failures through isolation (to prevent cascading effects) and self-healing (to recover automatically). Tactics include:
Load balancers
Circuit breaker patterns
Timeouts and throttling (bulkheads)
Elastic scalability
Graceful degradation (e.g. serving cached or canned responses, switching to read-only mode…)
Idempotent service design
Retry/recycling logic
Delta-based data handling (to fix past errors and replay only valid operations)
Automatic failover and health checks
These systems should be tested in production via chaos engineering, the practice of intentionally injecting failures to reveal weaknesses before they cause real harm.
3. Redundancy Across All Layers
Redundancy means deploying backup platforms across different infrastructures (e.g. cloud providers, operating systems, data centers) to reduce the risk of simultaneous failures. Key principles include:
Redundancy in compute (stateless components) and storage (stateful components)
Multi-node clusters for both scalability and fault tolerance
When significant synchronization is required between different nodes (e.g., communication between stateless and stateful components), low latency between those nodes is essential. Ideally, these nodes should reside within the same data center or in geographically close data centers. In such cases, they can form a stretched cluster, enabling synchronous operations. However, when latency is too high—typically due to geographical distance, we refer to the secondary location as a Disaster Recovery (DR) site, which is not synchronized in real time but operates in asynchronous mode.
Various redundancy strategies can be applied depending on architectural and business needs:
Redundancy within a single data center or across Availability Zones (Multi-AZ), where an AZ is a physically distinct data center (or group of centers) within a region. For higher availability and fault tolerance, organizations can opt for multi-regionor even multi-cloud setups.
Redundancy levels may vary per component (stateless or stateful) and can range from a single replica to multiple copies, depending on criticality. Redundancy types can include hot, warm, cold, or glacier (archival) storage.
- Redundancy configurations include:
Active/Passive: Only the primary infrastructure is active under normal conditions. The passive environment can vary in readiness, from near-instant failover to requiring manual activation or even full hardware spin-up. As only one side is active, unidirectional synchronization is typically sufficient.
Active/Active: Both infrastructures are fully operational and synchronized bidirectionally. This setup is more complex but offers better fault tolerance, higher availability, and more efficient use of resources.
4. Enable Rapid Incident Response
When automation fails, manual intervention must be quick, safe, and effective. This requires:
Continuous monitoring of business and technical events, with anomaly detection and alerting. This includes transitioning from basic monitoring to full observability, enabling teams to understand why a failure occurred through logs, metrics, and traces and leveraging AI/ML for predictive analytics and early anomaly detection
Clear error messages and accessible logs for rapid root-cause analysis
Well-documented Business Continuity Plans (BCPs), including fallbacks like manual processing.
Strong incident response teams and blameless postmortems to improve resilience over time.
Tooling to support manual fixes, such as load shedding by filtering non-critical requests, rollbacks and state restoration, isolated fix deployments, service/module disabling, (bulk) manual data corrections, process restarts…
All manual actions should be auditable and reversible wherever possible
For a deeper dive, see 2 of my previous blogs: * Building resilient systems in the Financial Services industry(https://bankloch.blogspot.com/2020/02/building-resilient-systems-in-financial.html) * Building Resilience: Safeguarding Financial Services in the Digital Age(https://bankloch.blogspot.com/2024/08/building-resilience-safeguarding.html)
Ultimately, resilient systems aim to optimize three critical dimensions:
Availability - "How often are we down?"
Measured as a percentage of uptime, availability is often expressed in "nines" (e.g., 99.999% = five nines).
Achieving high availability requires eliminating single points of failure, deploying multi-AZ or multi-region architectures, and conducting regular failure scenario testing (e.g. chaos engineering).
The cost of increasing availability grows exponentially with each additional “nine.”
RTO (Recovery Time Objective) – "How long are we down?"
Defines the maximum acceptable downtime after an incident.
For example, an RTO of 15 minutes means the system must be restored within 15 minutes of failure.
Shorter RTOs demand faster recovery processes, automation, and high operational maturity.
RPO (Recovery Point Objective) – "How much data can we afford to lose?"
Determines the maximum tolerable amount of data loss, measured in time.
For instance, an RPO of 5 minutes means at most 5 minutes of data can be lost during a failure.
Achieving an RPO of zero typically requires synchronous replication, which introduces higher infrastructure demands and latency overhead.
As a rule of thumb. Low RPO is expensive. Low RTO is complex. Achieving both is exceptionally difficult and costly.
In today’s hyperconnected and high-stakes digital environment, resilience is no longer a luxury, it’s a necessity. Whether through robust system design, intelligent redundancy, or agile incident response, financial institutions must continuously evolve their strategies to protect availability, data integrity, and customer trust. But resilience comes with a price. The key is to strike the right balance between technical ambition and pragmatic investment, driven by business value, regulatory pressure, and a relentless focus on operational excellence.

Comments
Post a Comment