Building Resilience: Safeguarding Financial Services in the Digital Age

In July of this year, the world was shocked by a major IT incident caused by an update from the cybersecurity firm CrowdStrike. A wrong patch led to a global IT outage, affecting 8.5 million Windows devices and causing the cancellation of more than 5,000 commercial airline flights. This incident also impacted hospitals, media, and banks, resulting in an estimated $1.5 billion in losses. Dubbed the "Largest IT outage in history," this event highlighted the vulnerabilities in our interconnected digital world.

While this particular incident grabbed global headlines due to its scale, massive IT outages are becoming increasingly common. As companies rely more on cloud services and many depend on the same IT software, the risk of widespread disruptions grows. Here are some notable examples of major incidents in recent years:

Atlassian JIRA Outage (April 2022): Atlassian is used by many major corporations for managing their IT departments (via software like JIRA, Confluence and ServiceDesk). A wrong "Delete" script removed data for several hundred major customers. Due to the complexity of the issue, it took several weeks to fully restore service for all impacted accounts.
Amazon Web Services (AWS) Outages: AWS, the market leader in cloud services, has experienced significant outages almost yearly, affecting countless businesses due to its vast user base. Some examples of recent AWS outages can be found on "https://www.datacenterknowledge.com/outages/a-history-of-aws-cloud-and-data-center-outages".
Facebook (META) Outage (October 2021): A routine maintenance error caused a global shutdown of Facebook’s services for 6-7 hours. The incident was triggered by a simple command mistake during a capacity check.
Fastly Outage (June 8, 2021): A minor setting change by a customer activated a dormant bug, causing major websites like The New York Times and CNN to go dark for nearly an hour.
Google Outage (December 14, 2020): Google’s authentication system ran out of storage, leading to a 45-minute outage affecting Gmail, YouTube, and Google Drive.

These incidents demonstrate how interconnected and dependent modern organizations are on cloud services and internet infrastructure. They also show the profound impact small human errors at these companies can have. Due to their scale, the economic losses from these outages are enormous.

Businesses should be aware of their enormous dependence on IT and how a single human error can have severe impacts. Especially in the financial services industry, which is not only critical for the worldwide economy but also enormously dependent on IT, this risk is enormous and should be mitigated. This is where resilience comes into play. Building a resilient organization and resilient systems is a must-have for every financial institution. This resilience must be implemented at all levels of the organization: business, operational, and technical resilience.

This is done by identifying all points of failure and taking measures to ensure the impact of each failure is minimized. The complexity lies, of course, in the fact that this list of points of failure within a large organization like a bank is enormous. It is therefore essential to focus first on the points of failure that have the highest risk value, calculated by multiplying the probability of failure by the cost when such a failure occurs.

In IT, failures can be categorized into three large categories depending on their origin:

Hardware failures: Physical malfunctions of hardware components.
Human Unintentional Errors: Mistakes made unintentionally, like most of the above-described outages.
Malicious Errors: Failures invoked with bad intent. In recent years, multiple major outages were caused by cyber-attacks (usually ransomware attacks). Some examples include:
- ExPetr / NotPetya (2017): Resulting in a total impact of $10 billion, with major companies like global shipping company Maersk and pharmaceutical giant Merck, as well as major governments, impacted.
- WannaCry (2017): Resulting in a total impact of $4 billion, with major companies like Spain’s mobile company Telefónica and the UK’s National Health Service.
- REvil/Sodinokibi (2020): Impacting major companies like Kaseya, Travelex, and JBS Foods.
- …

The last two categories can further be split up into errors introduced by employees, partners/vendors, customers, and/or externals (unknowns). This shows there are many axes for failure, against which a financial services company needs to protect itself.

A company should therefore use a variety of different strategies, which combined, result in optimal resilience:

Resilient System Design:
- Failure Tolerant Systems: Systems should tolerate hardware failures and network issues, support recoverability, and ensure operations like restoring backups and relaunching processes without negative impacts (e.g. via recycling and relaunch mechanisms and the use of idempotent services).
- Self-Healing Systems: Design systems to quickly identify and recover from failures automatically, using mechanisms like load balancers, throttling, circuit breakers, timeouts, and elastic scalability. Such mechanisms help to avoid the ripple effect (i.e. avoid that failure in one component brings down other components and systems) and ensure a quick restore to normal.
- For further details on building resilient systems, you can refer to the blog "Building resilient systems in the Financial Services industry" (https://bankloch.blogspot.com/2020/02/building-resilient-systems-in-financial.html) I wrote four years ago.
Redundancy: Implement backup platforms running on different infrastructures (e.g. different operating systems, cloud providers, or data centres) to minimize the risk of simultaneous failures. Adopting a multi-cloud strategy for critical applications can help achieve this but may incur additional costs.
Robust Business Continuity Plans: Develop and regularly update procedures for major business continuity issues. These should include steps to minimize negative impacts when IT systems are down, such as switching to manual operations if necessary.
Continuous Monitoring: This includes continuous monitoring of all business events (business activity monitoring), technical monitoring, proactive detection of anomalies and the continuous identification of bottlenecks. This allows a continuous evaluation of the state of the architecture and a proactive identification and analysis of potential issues. By immediately getting transparency on the impact (business value and which customers) of certain issues, resource allocation can be optimized (to those issues with the most business impact) and impacted customers can be proactively informed.
Continuous Testing and Gradual Deployments: Continuously test your software via automated tests. This includes functional tests, but also non-functional tests like performance and security tests, as well as failure tests, where specific failures are simulated to see if systems can properly handle those disruptions. Special failure testing frameworks like Netflix’s Simian Army or Gremlin (Failure as a Service) can be used for this, which some of the leading technology companies even execute in production as the ultimate proof of their resilience. But even with those tests, unexpected issues are likely to still occur. Therefore, gradual deployment rollouts, like canary testing, A/B testing, blue/green or red/black deployments are essential to limit the potential impact of such an issue.

Organizations must invest in a whole resilience strategy to guard against failures and minimize disruptions. Not only will this ensure delivering a continuous, reliable service to customers, but more and more regulators will impose resilience standards for critical financial applications. In Europe, the Digital Operational Resilience Act (cfr. my blog "The Dawn of DORA: Building a Resilient Financial Infrastructure" - https://bankloch.blogspot.com/2024/05/the-dawn-of-dora-building-resilient.html) and the UK’s Operational Resilience Policy mandate stringent measures to ensure the stability of financial services.

In the end, failures are inevitable and simply a matter of time before they happen. Or to quote Werner Vogels, CTO of Amazon.com: "Everything fails all the time". The true differentiator is implementing necessary measures to minimize their impact (design for failure).

For more insights, visit my blog at https://bankloch.blogspot.com

Comments

Transforming the insurance sector to an Open API Ecosystem

1. Introduction "Open" has recently become a new buzzword in the financial services industry, i.e. open data, open APIs, Open Banking, Open Insurance …, but what does this new buzzword really mean? "Open" refers to the capability of companies to expose their services to the outside world, so that external partners or even competitors can use these services to bring added value to their customers. This trend is made possible by the technological evolution of open APIs (Application Programming Interfaces), which are the digital ports making this communication possible. Together companies, interconnected through open APIs, form a true API ecosystem , offering best-of-breed customer experience, by combining the digital services offered by multiple companies. In the technology sector this evolution has been ongoing for multiple years (think about the travelling sector, allowing you to book any hotel online). An excelle...

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

Hypes and marketing buzz words are strongly present in the IT landscape. Often these are existing concepts, which have evolved technologically and are then renamed to a new term, as if it were a brand new technology or concept. If you want to understand and assess these new trends, it is important to reduce the concepts to their essence and compare them with existing technologies , e.g. Integration (middleware) software ensures that 2 separate applications or components can be integrated in an easy way. Of course, there is a huge evolution in the protocols, volumes of exchanged data, scalability, performance…, but in essence the problem remains the same. Nonetheless, there have been multiple terms for integration software such as ETL, ESB, EAI, SOA, Service Mesh… Data storage software ensures that data is stored in such a way that data is not lost and that there is some kind guaranteed consistency, maximum availability and scalability, easy retrieval...

IoT - Revolution or Evolution in the Financial Services Industry

1. The IoT hype We have all heard about the "Internet of Things" (IoT) as this revolutionary new technology, which will radically change our lives. But is it really such a revolution and will it really have an impact on the Financial Services Industry? To refresh our memory, the Internet of Things (IoT) refers to any object , which is able to collect data and communicate and share this information (like condition, geolocation…) over the internet . This communication will often occur between 2 objects (i.e. not involving any human), which is often referred to as Machine-to-Machine (M2M) communication. Well known examples are home thermostats, home security systems, fitness and health monitors, wearables… This all seems futuristic, but smartphones, tablets and smartwatches can also be considered as IoT devices. More importantly, beside these futuristic visions of IoT, the smartphone will most likely continue to be the cent...

A bank account - A concept of the past

Almost every recent article written about banking starts with the statement that the banking industry is being disrupted by new competitors, new innovations and new technologies. Although this statement is definitely true, the extend of the disruption can still be debated. Even the most innovative neo-banks still work with bank (current, saving, term and investment) accounts, cards (credit and debit), traditional credits, existing payment infrastructure… The user experience surrounding the origination and servicing of these products has dramatically improved (and will continue to evolve), but the underlying banking products are not really disrupted. You could argue that banking products are so intertwined with society and our way of thinking about finance, that they can’t be disrupted, but looking at those products you cannot ignore that they are far from an optimal solution in our current digital world. Let’s consider cards for example. Isn’t ...

AI in Financial Services - A buzzword that is here to stay!

In a few of my most recent blogs I tried to demystify some of the buzzwords (like blockchain, Low- and No-Code platforms, RPA…), which are commonly used in the financial services industry. These buzzwords often entail interesting innovations, but contrary to their promise, they are not silver bullets solving any problem. Another such buzzword is AI (or also referred to as Machine Learning, Deep Learning, Enforced Learning… - the difference between those terms put aside). Again this term is also seriously hyped, creating unrealistic expectations, but contrary to many other buzzwords, this is something I truly believe will have a much larger impact on the financial services industry than many other buzzwords. This opinion is backed by a study of McKinsey and PWC indicating that 72% of company leaders consider that AI will be the most competitive advantage of the future and that this technology will be the most disruptive force in the decades to come. Deep Lea...

An overview of 1-year blogging

Last week I published my 60th post on my blog called Bankloch (a reference to "Banking" and my family name). The past year, I have published a blog on a weekly basis, providing my humble personal vision on the topics of Fintech, IT software delivery and mobility. This blogging has mainly been a personal enrichment , as it forced me to dive deep into a number of different topics, not only in researching for content, but also in trying to identify trends, innovations and patterns into these topics. Furthermore it allowed me to have several very interesting conversations and discussions with passionate colleagues in the financial industry and to get more insights into the wonderful world of blogging and more general of digital marketing, exploring subjects and tools like: Search Engine Optimization (SEO) LinkedIn post optimization Google Search Console Google AdWorks Google Blogger Thinker360 Finextra … Clearly it is not easy to get the necessary ...

From app to super-app to personal assistant

In July of this year, KBC bank (the 2nd largest bank in Belgium) surprised many people, including many of us working in the banking industry, with their announcement that they bought the rights to broadcast the highlights of soccer matches in Belgium via their mobile app (a service called "Goal alert"). The days following this announcement the news was filled with experts, some of them categorizing it as a brilliant move, others claiming that KBC should better focus on its core mission. Independent of whether it is a good or bad strategic decision (the future will tell), it is clearly part of a much larger strategy of KBC to convert their banking app into a super-app (all-in-one app) . Today you can already buy mobility tickets and cinema tickets and use other third-party services (like Monizze, eBox, PayPal…) within the KBC app. Furthermore, end of last year, KBC announced opening up their app also to non-customers allowing them to also use these thi...

Marketplaces in the financial industry - Here to stay?

Marketplaces are hip and trendy on the internet and will likely evolve even more in the near future. In some markets (like food delivery, transportation, commerce, holiday…) they already represent double digit market shares (e.g. in 2018 $1.86 trillion was spent globally on the top 100 online marketplaces), but for the financial services sector, their impact (even though there are a few unicorn FinTechs in this space) on the industry is still limited. Any form of intermediation (travel agents, taxi dispatchers…) will likely be replaced by a modern, digital and more direct equivalent, i.e. a digital marketplace. As the business of banks is exactly the intermediation between people having excess money and people needing money, the financial services sector will be significantly impacted. Furthermore, marketplaces are strongly intertwined with other concepts like the gig-economy, the sharing-economy and the API-economy . All these trends will ultimately...

Peer-to-peer payments - A crucial component towards a cashless society

The Corona crisis has led to an exponential decrease in the usage of cash , due to the associated hygienic problems and the enormous rise of eCommerce. While in commercial transactions cash is disappearing rapidly, it is however still commonly used for informal money exchanges , like between friends, family, colleagues…, but also those payments are becoming more and more digital, thanks to peer-to-peer payment (P2P) solutions . These solutions drastically improve the user experience (removing friction) for both the person initiating the payment (= the payer) and the person receiving the payment (= the recipient), compared to a simple initiation of a wire transfer in a banking app. Before clarifying where those solutions bring most value, it is important to first identify the typical use cases , where peer-to-peer payments are most common, as the P2P payment solutions need to optimally accommodate these use cases: Family giving a cash gif...

Neobanks should find their niche to improve their profitability

The last 5 years dozens of so-called neo- or challenger banks (according to Exton Consulting 256 neobanks are in circulation today) have disrupted the banking landscape, by offering a fully digitized (cfr. "tech companies with a banking license"), very customer-centric, simple and fluent (e.g. possibility to become client and open an account in a few clicks) and low-cost product and service offering. While several of them are already valued at billions of euros (like Revolut, Monzo, Chime, N26, NuBank…), very few of them are expected to be profitable in the coming years and even less are already profitable today (Accenture research shows that the average UK neobank loses $11 per user yearly). These challenger banks are typically confronted with increasing costs, while the margins generated per customer remain low (e.g. due to the offering of free products and services or above market-level saving account interest rates). While it’s obvious that disrupting the financial ma...

Bankloch

Search This Blog