Skip to main content

Building Resilience: Safeguarding Financial Services in the Digital Age


In July of this year, the world was shocked by a major IT incident caused by an update from the cybersecurity firm CrowdStrike. A wrong patch led to a global IT outage, affecting 8.5 million Windows devices and causing the cancellation of more than 5,000 commercial airline flights. This incident also impacted hospitals, media, and banks, resulting in an estimated $1.5 billion in losses. Dubbed the "Largest IT outage in history," this event highlighted the vulnerabilities in our interconnected digital world.

While this particular incident grabbed global headlines due to its scale, massive IT outages are becoming increasingly common. As companies rely more on cloud services and many depend on the same IT software, the risk of widespread disruptions grows. Here are some notable examples of major incidents in recent years:

  • Atlassian JIRA Outage (April 2022): Atlassian is used by many major corporations for managing their IT departments (via software like JIRA, Confluence and ServiceDesk). A wrong "Delete" script removed data for several hundred major customers. Due to the complexity of the issue, it took several weeks to fully restore service for all impacted accounts.

  • Amazon Web Services (AWS) Outages: AWS, the market leader in cloud services, has experienced significant outages almost yearly, affecting countless businesses due to its vast user base. Some examples of recent AWS outages can be found on "https://www.datacenterknowledge.com/outages/a-history-of-aws-cloud-and-data-center-outages".

  • Facebook (META) Outage (October 2021): A routine maintenance error caused a global shutdown of Facebook’s services for 6-7 hours. The incident was triggered by a simple command mistake during a capacity check.

  • Fastly Outage (June 8, 2021): A minor setting change by a customer activated a dormant bug, causing major websites like The New York Times and CNN to go dark for nearly an hour.

  • Google Outage (December 14, 2020): Google’s authentication system ran out of storage, leading to a 45-minute outage affecting Gmail, YouTube, and Google Drive.

These incidents demonstrate how interconnected and dependent modern organizations are on cloud services and internet infrastructure. They also show the profound impact small human errors at these companies can have. Due to their scale, the economic losses from these outages are enormous.

Businesses should be aware of their enormous dependence on IT and how a single human error can have severe impacts. Especially in the financial services industry, which is not only critical for the worldwide economy but also enormously dependent on IT, this risk is enormous and should be mitigated. This is where resilience comes into play. Building a resilient organization and resilient systems is a must-have for every financial institution. This resilience must be implemented at all levels of the organization: business, operational, and technical resilience.

This is done by identifying all points of failure and taking measures to ensure the impact of each failure is minimized. The complexity lies, of course, in the fact that this list of points of failure within a large organization like a bank is enormous. It is therefore essential to focus first on the points of failure that have the highest risk value, calculated by multiplying the probability of failure by the cost when such a failure occurs.

In IT, failures can be categorized into three large categories depending on their origin:

  • Hardware failures: Physical malfunctions of hardware components.

  • Human Unintentional Errors: Mistakes made unintentionally, like most of the above-described outages.

  • Malicious Errors: Failures invoked with bad intent. In recent years, multiple major outages were caused by cyber-attacks (usually ransomware attacks). Some examples include:

    • ExPetr / NotPetya (2017): Resulting in a total impact of $10 billion, with major companies like global shipping company Maersk and pharmaceutical giant Merck, as well as major governments, impacted.

    • WannaCry (2017): Resulting in a total impact of $4 billion, with major companies like Spain’s mobile company Telefónica and the UK’s National Health Service.

    • REvil/Sodinokibi (2020): Impacting major companies like Kaseya, Travelex, and JBS Foods.

    • …​

The last two categories can further be split up into errors introduced by employees, partners/vendors, customers, and/or externals (unknowns). This shows there are many axes for failure, against which a financial services company needs to protect itself.

A company should therefore use a variety of different strategies, which combined, result in optimal resilience:

  • Resilient System Design:

    • Failure Tolerant Systems: Systems should tolerate hardware failures and network issues, support recoverability, and ensure operations like restoring backups and relaunching processes without negative impacts (e.g. via recycling and relaunch mechanisms and the use of idempotent services).

    • Self-Healing Systems: Design systems to quickly identify and recover from failures automatically, using mechanisms like load balancers, throttling, circuit breakers, timeouts, and elastic scalability. Such mechanisms help to avoid the ripple effect (i.e. avoid that failure in one component brings down other components and systems) and ensure a quick restore to normal.

    • For further details on building resilient systems, you can refer to the blog "Building resilient systems in the Financial Services industry" (https://bankloch.blogspot.com/2020/02/building-resilient-systems-in-financial.html) I wrote four years ago.

  • Redundancy: Implement backup platforms running on different infrastructures (e.g. different operating systems, cloud providers, or data centres) to minimize the risk of simultaneous failures. Adopting a multi-cloud strategy for critical applications can help achieve this but may incur additional costs.

  • Robust Business Continuity Plans: Develop and regularly update procedures for major business continuity issues. These should include steps to minimize negative impacts when IT systems are down, such as switching to manual operations if necessary.

  • Continuous Monitoring: This includes continuous monitoring of all business events (business activity monitoring), technical monitoring, proactive detection of anomalies and the continuous identification of bottlenecks. This allows a continuous evaluation of the state of the architecture and a proactive identification and analysis of potential issues. By immediately getting transparency on the impact (business value and which customers) of certain issues, resource allocation can be optimized (to those issues with the most business impact) and impacted customers can be proactively informed.

  • Continuous Testing and Gradual Deployments: Continuously test your software via automated tests. This includes functional tests, but also non-functional tests like performance and security tests, as well as failure tests, where specific failures are simulated to see if systems can properly handle those disruptions. Special failure testing frameworks like Netflix’s Simian Army or Gremlin (Failure as a Service) can be used for this, which some of the leading technology companies even execute in production as the ultimate proof of their resilience. But even with those tests, unexpected issues are likely to still occur. Therefore, gradual deployment rollouts, like canary testing, A/B testing, blue/green or red/black deployments are essential to limit the potential impact of such an issue.

Organizations must invest in a whole resilience strategy to guard against failures and minimize disruptions. Not only will this ensure delivering a continuous, reliable service to customers, but more and more regulators will impose resilience standards for critical financial applications. In Europe, the Digital Operational Resilience Act (cfr. my blog "The Dawn of DORA: Building a Resilient Financial Infrastructure" - https://bankloch.blogspot.com/2024/05/the-dawn-of-dora-building-resilient.html) and the UK’s Operational Resilience Policy mandate stringent measures to ensure the stability of financial services.

In the end, failures are inevitable and simply a matter of time before they happen. Or to quote Werner Vogels, CTO of Amazon.com: "Everything fails all the time". The true differentiator is implementing necessary measures to minimize their impact (design for failure).

For more insights, visit my blog at https://bankloch.blogspot.com 

Comments

Popular posts from this blog

Transforming the insurance sector to an Open API Ecosystem

1. Introduction "Open" has recently become a new buzzword in the financial services industry, i.e.   open data, open APIs, Open Banking, Open Insurance …​, but what does this new buzzword really mean? "Open" refers to the capability of companies to expose their services to the outside world, so that   external partners or even competitors   can use these services to bring added value to their customers. This trend is made possible by the technological evolution of   open APIs (Application Programming Interfaces), which are the   digital ports making this communication possible. Together companies, interconnected through open APIs, form a true   API ecosystem , offering best-of-breed customer experience, by combining the digital services offered by multiple companies. In the   technology sector   this evolution has been ongoing for multiple years (think about the travelling sector, allowing you to book any hotel online). An excelle...

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

Hypes and marketing buzz words are strongly present in the IT landscape. Often these are existing concepts, which have evolved technologically and are then renamed to a new term, as if it were a brand new technology or concept. If you want to understand and assess these new trends, it is important to   reduce the concepts to their essence and compare them with existing technologies , e.g. Integration (middleware) software   ensures that 2 separate applications or components can be integrated in an easy way. Of course, there is a huge evolution in the protocols, volumes of exchanged data, scalability, performance…​, but in essence the problem remains the same. Nonetheless, there have been multiple terms for integration software such as ETL, ESB, EAI, SOA, Service Mesh…​ Data storage software   ensures that data is stored in such a way that data is not lost and that there is some kind guaranteed consistency, maximum availability and scalability, easy retrieval...

IoT - Revolution or Evolution in the Financial Services Industry

1. The IoT hype We have all heard about the   "Internet of Things" (IoT)   as this revolutionary new technology, which will radically change our lives. But is it really such a revolution and will it really have an impact on the Financial Services Industry? To refresh our memory, the Internet of Things (IoT) refers to any   object , which is able to   collect data and communicate and share this information (like condition, geolocation…​)   over the internet . This communication will often occur between 2 objects (i.e. not involving any human), which is often referred to as Machine-to-Machine (M2M) communication. Well known examples are home thermostats, home security systems, fitness and health monitors, wearables…​ This all seems futuristic, but   smartphones, tablets and smartwatches   can also be considered as IoT devices. More importantly, beside these futuristic visions of IoT, the smartphone will most likely continue to be the cent...

AI in Financial Services - A buzzword that is here to stay!

In a few of my most recent blogs I tried to   demystify some of the buzzwords   (like blockchain, Low- and No-Code platforms, RPA…​), which are commonly used in the financial services industry. These buzzwords often entail interesting innovations, but contrary to their promise, they are not silver bullets solving any problem. Another such buzzword is   AI   (or also referred to as Machine Learning, Deep Learning, Enforced Learning…​ - the difference between those terms put aside). Again this term is also seriously hyped, creating unrealistic expectations, but contrary to many other buzzwords, this is something I truly believe will have a much larger impact on the financial services industry than many other buzzwords. This opinion is backed by a study of McKinsey and PWC indicating that 72% of company leaders consider that AI will be the most competitive advantage of the future and that this technology will be the most disruptive force in the decades to come. Deep Lea...

An overview of 1-year blogging

Last week I published my   60th post   on my blog called   Bankloch   (a reference to "Banking" and my family name). The past year, I have published a blog on a weekly basis, providing my humble personal vision on the topics of Fintech, IT software delivery and mobility. This blogging has mainly been a   personal enrichment , as it forced me to dive deep into a number of different topics, not only in researching for content, but also in trying to identify trends, innovations and patterns into these topics. Furthermore it allowed me to have several very interesting conversations and discussions with passionate colleagues in the financial industry and to get more insights into the wonderful world of blogging and more general of digital marketing, exploring subjects and tools like: Search Engine Optimization (SEO) LinkedIn post optimization Google Search Console Google AdWorks Google Blogger Thinker360 Finextra …​ Clearly it is   not easy to get the necessary ...

The UPI Phenomenon: From Zero to 10 Billion

If there is one Indian innovation that has grabbed   global headlines , it is undoubtedly the instant payment system   UPI (Unified Payments Interface) . In August 2023, monthly UPI transactions exceeded an astounding 10 billion, marking a remarkable milestone for India’s payments ecosystem. No wonder that UPI has not only revolutionized transactions in India but has also gained international recognition for its remarkable growth. Launched in 2016 by the   National Payments Corporation of India (NPCI)   in collaboration with 21 member banks, UPI quickly became popular among consumers and businesses. In just a few years, it achieved   remarkable milestones : By August 2023, UPI recorded an unprecedented   10.58 billion transactions , with an impressive 50% year-on-year growth. This volume represented approximately   190 billion euros . In July 2023, the UPI network connected   473 different banks . UPI is projected to achieve a staggering   1 ...

A bank account - A concept of the past

Almost every recent article written about banking starts with the statement that the   banking industry is being disrupted   by new competitors, new innovations and new technologies. Although this statement is definitely true, the extend of the disruption can still be debated. Even the most innovative neo-banks still work with bank (current, saving, term and investment) accounts, cards (credit and debit), traditional credits, existing payment infrastructure…​ The user experience surrounding the origination and servicing of these products has dramatically improved (and will continue to evolve), but the underlying banking products are not really disrupted. You could argue that banking products are so intertwined with society and our way of thinking about finance, that they can’t be disrupted, but looking at those products you cannot ignore that they are far from an optimal solution in our current digital world. Let’s consider   cards   for example. Isn’t ...

Low- and No-code platforms - Will IT developers soon be out of a job?

“ The future of coding is no coding at all ” - Chris Wanstrath (CEO at GitHub). Mid May I posted a blog on RPA (Robotic Process Automation -   https://bankloch.blogspot.com/2020/05/rpa-miracle-solution-for-incumbent.html ) on how this technology, promises the world to companies. A very similar story is found with low- and no-code platforms, which also promise that business people, with limited to no knowledge of IT, can create complex business applications. These   platforms originate , just as RPA tools,   from the growing demand for IT developments , while IT cannot keep up with the available capacity. As a result, an enormous gap between IT teams and business demands is created, which is often filled by shadow-IT departments, which extend the IT workforce and create business tools in Excel, Access, WordPress…​ Unfortunately these tools built in shadow-IT departments arrive very soon at their limits, as they don’t support the required non-functional requirements (like h...

From app to super-app to personal assistant

In July of this year,   KBC bank   (the 2nd largest bank in Belgium) surprised many people, including many of us working in the banking industry, with their announcement that they bought the rights to   broadcast the highlights of soccer matches   in Belgium via their mobile app (a service called "Goal alert"). The days following this announcement the news was filled with experts, some of them categorizing it as a brilliant move, others claiming that KBC should better focus on its core mission. Independent of whether it is a good or bad strategic decision (the future will tell), it is clearly part of a much larger strategy of KBC to   convert their banking app into a super-app (all-in-one app) . Today you can already buy mobility tickets and cinema tickets and use other third-party services (like Monizze, eBox, PayPal…​) within the KBC app. Furthermore, end of last year, KBC announced opening up their app also to non-customers allowing them to also use these thi...

Peer-to-peer payments - A crucial component towards a cashless society

The Corona crisis has led to an exponential   decrease in the usage of cash , due to the associated hygienic problems and the enormous rise of eCommerce. While in commercial transactions cash is disappearing rapidly, it is however still commonly used for   informal money exchanges , like between friends, family, colleagues…​, but also those payments are becoming more and more digital, thanks to   peer-to-peer payment (P2P) solutions . These solutions drastically   improve the user experience   (removing friction) for both the person initiating the payment (= the payer) and the person receiving the payment (= the recipient), compared to a simple initiation of a wire transfer in a banking app. Before clarifying where those solutions bring most value, it is important to first identify the   typical use cases , where peer-to-peer payments are most common, as the P2P payment solutions need to optimally accommodate these use cases: Family giving a   cash gif...