Skip to main content

Building Resilience: Safeguarding Financial Services in the Digital Age


In July of this year, the world was shocked by a major IT incident caused by an update from the cybersecurity firm CrowdStrike. A wrong patch led to a global IT outage, affecting 8.5 million Windows devices and causing the cancellation of more than 5,000 commercial airline flights. This incident also impacted hospitals, media, and banks, resulting in an estimated $1.5 billion in losses. Dubbed the "Largest IT outage in history," this event highlighted the vulnerabilities in our interconnected digital world.

While this particular incident grabbed global headlines due to its scale, massive IT outages are becoming increasingly common. As companies rely more on cloud services and many depend on the same IT software, the risk of widespread disruptions grows. Here are some notable examples of major incidents in recent years:

  • Atlassian JIRA Outage (April 2022): Atlassian is used by many major corporations for managing their IT departments (via software like JIRA, Confluence and ServiceDesk). A wrong "Delete" script removed data for several hundred major customers. Due to the complexity of the issue, it took several weeks to fully restore service for all impacted accounts.

  • Amazon Web Services (AWS) Outages: AWS, the market leader in cloud services, has experienced significant outages almost yearly, affecting countless businesses due to its vast user base. Some examples of recent AWS outages can be found on "https://www.datacenterknowledge.com/outages/a-history-of-aws-cloud-and-data-center-outages".

  • Facebook (META) Outage (October 2021): A routine maintenance error caused a global shutdown of Facebook’s services for 6-7 hours. The incident was triggered by a simple command mistake during a capacity check.

  • Fastly Outage (June 8, 2021): A minor setting change by a customer activated a dormant bug, causing major websites like The New York Times and CNN to go dark for nearly an hour.

  • Google Outage (December 14, 2020): Google’s authentication system ran out of storage, leading to a 45-minute outage affecting Gmail, YouTube, and Google Drive.

These incidents demonstrate how interconnected and dependent modern organizations are on cloud services and internet infrastructure. They also show the profound impact small human errors at these companies can have. Due to their scale, the economic losses from these outages are enormous.

Businesses should be aware of their enormous dependence on IT and how a single human error can have severe impacts. Especially in the financial services industry, which is not only critical for the worldwide economy but also enormously dependent on IT, this risk is enormous and should be mitigated. This is where resilience comes into play. Building a resilient organization and resilient systems is a must-have for every financial institution. This resilience must be implemented at all levels of the organization: business, operational, and technical resilience.

This is done by identifying all points of failure and taking measures to ensure the impact of each failure is minimized. The complexity lies, of course, in the fact that this list of points of failure within a large organization like a bank is enormous. It is therefore essential to focus first on the points of failure that have the highest risk value, calculated by multiplying the probability of failure by the cost when such a failure occurs.

In IT, failures can be categorized into three large categories depending on their origin:

  • Hardware failures: Physical malfunctions of hardware components.

  • Human Unintentional Errors: Mistakes made unintentionally, like most of the above-described outages.

  • Malicious Errors: Failures invoked with bad intent. In recent years, multiple major outages were caused by cyber-attacks (usually ransomware attacks). Some examples include:

    • ExPetr / NotPetya (2017): Resulting in a total impact of $10 billion, with major companies like global shipping company Maersk and pharmaceutical giant Merck, as well as major governments, impacted.

    • WannaCry (2017): Resulting in a total impact of $4 billion, with major companies like Spain’s mobile company Telefónica and the UK’s National Health Service.

    • REvil/Sodinokibi (2020): Impacting major companies like Kaseya, Travelex, and JBS Foods.

    • …​

The last two categories can further be split up into errors introduced by employees, partners/vendors, customers, and/or externals (unknowns). This shows there are many axes for failure, against which a financial services company needs to protect itself.

A company should therefore use a variety of different strategies, which combined, result in optimal resilience:

  • Resilient System Design:

    • Failure Tolerant Systems: Systems should tolerate hardware failures and network issues, support recoverability, and ensure operations like restoring backups and relaunching processes without negative impacts (e.g. via recycling and relaunch mechanisms and the use of idempotent services).

    • Self-Healing Systems: Design systems to quickly identify and recover from failures automatically, using mechanisms like load balancers, throttling, circuit breakers, timeouts, and elastic scalability. Such mechanisms help to avoid the ripple effect (i.e. avoid that failure in one component brings down other components and systems) and ensure a quick restore to normal.

    • For further details on building resilient systems, you can refer to the blog "Building resilient systems in the Financial Services industry" (https://bankloch.blogspot.com/2020/02/building-resilient-systems-in-financial.html) I wrote four years ago.

  • Redundancy: Implement backup platforms running on different infrastructures (e.g. different operating systems, cloud providers, or data centres) to minimize the risk of simultaneous failures. Adopting a multi-cloud strategy for critical applications can help achieve this but may incur additional costs.

  • Robust Business Continuity Plans: Develop and regularly update procedures for major business continuity issues. These should include steps to minimize negative impacts when IT systems are down, such as switching to manual operations if necessary.

  • Continuous Monitoring: This includes continuous monitoring of all business events (business activity monitoring), technical monitoring, proactive detection of anomalies and the continuous identification of bottlenecks. This allows a continuous evaluation of the state of the architecture and a proactive identification and analysis of potential issues. By immediately getting transparency on the impact (business value and which customers) of certain issues, resource allocation can be optimized (to those issues with the most business impact) and impacted customers can be proactively informed.

  • Continuous Testing and Gradual Deployments: Continuously test your software via automated tests. This includes functional tests, but also non-functional tests like performance and security tests, as well as failure tests, where specific failures are simulated to see if systems can properly handle those disruptions. Special failure testing frameworks like Netflix’s Simian Army or Gremlin (Failure as a Service) can be used for this, which some of the leading technology companies even execute in production as the ultimate proof of their resilience. But even with those tests, unexpected issues are likely to still occur. Therefore, gradual deployment rollouts, like canary testing, A/B testing, blue/green or red/black deployments are essential to limit the potential impact of such an issue.

Organizations must invest in a whole resilience strategy to guard against failures and minimize disruptions. Not only will this ensure delivering a continuous, reliable service to customers, but more and more regulators will impose resilience standards for critical financial applications. In Europe, the Digital Operational Resilience Act (cfr. my blog "The Dawn of DORA: Building a Resilient Financial Infrastructure" - https://bankloch.blogspot.com/2024/05/the-dawn-of-dora-building-resilient.html) and the UK’s Operational Resilience Policy mandate stringent measures to ensure the stability of financial services.

In the end, failures are inevitable and simply a matter of time before they happen. Or to quote Werner Vogels, CTO of Amazon.com: "Everything fails all the time". The true differentiator is implementing necessary measures to minimize their impact (design for failure).

For more insights, visit my blog at https://bankloch.blogspot.com 

Comments

Popular posts from this blog

Transforming the insurance sector to an Open API Ecosystem

1. Introduction "Open" has recently become a new buzzword in the financial services industry, i.e.   open data, open APIs, Open Banking, Open Insurance …​, but what does this new buzzword really mean? "Open" refers to the capability of companies to expose their services to the outside world, so that   external partners or even competitors   can use these services to bring added value to their customers. This trend is made possible by the technological evolution of   open APIs (Application Programming Interfaces), which are the   digital ports making this communication possible. Together companies, interconnected through open APIs, form a true   API ecosystem , offering best-of-breed customer experience, by combining the digital services offered by multiple companies. In the   technology sector   this evolution has been ongoing for multiple years (think about the travelling sector, allowing you to book any hotel online). An excellent example of this

Are product silos in a bank inevitable?

Silo thinking   is often frowned upon in the industry. It is often a synonym for bureaucratic processes and politics and in almost every article describing the threats of new innovative Fintech players on the banking industry, the strong bank product silos are put forward as one of the main blockages why incumbent banks are not able to (quickly) react to the changing customer expectations. Customers want solutions to their problems   and do not want to be bothered about the internal organisation of their bank. Most banks are however organized by product domain (daily banking, investments and lending) and by customer segmentation (retail banking, private banking, SMEs and corporates). This division is reflected both at business and IT side and almost automatically leads to the creation of silos. It is however difficult to reorganize a bank without creating new silos or introducing other types of issues and inefficiencies. An organization is never ideal and needs to take a number of cons

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

Hypes and marketing buzz words are strongly present in the IT landscape. Often these are existing concepts, which have evolved technologically and are then renamed to a new term, as if it were a brand new technology or concept. If you want to understand and assess these new trends, it is important to   reduce the concepts to their essence and compare them with existing technologies , e.g. Integration (middleware) software   ensures that 2 separate applications or components can be integrated in an easy way. Of course, there is a huge evolution in the protocols, volumes of exchanged data, scalability, performance…​, but in essence the problem remains the same. Nonetheless, there have been multiple terms for integration software such as ETL, ESB, EAI, SOA, Service Mesh…​ Data storage software   ensures that data is stored in such a way that data is not lost and that there is some kind guaranteed consistency, maximum availability and scalability, easy retrieval and searching

IoT - Revolution or Evolution in the Financial Services Industry

1. The IoT hype We have all heard about the   "Internet of Things" (IoT)   as this revolutionary new technology, which will radically change our lives. But is it really such a revolution and will it really have an impact on the Financial Services Industry? To refresh our memory, the Internet of Things (IoT) refers to any   object , which is able to   collect data and communicate and share this information (like condition, geolocation…​)   over the internet . This communication will often occur between 2 objects (i.e. not involving any human), which is often referred to as Machine-to-Machine (M2M) communication. Well known examples are home thermostats, home security systems, fitness and health monitors, wearables…​ This all seems futuristic, but   smartphones, tablets and smartwatches   can also be considered as IoT devices. More importantly, beside these futuristic visions of IoT, the smartphone will most likely continue to be the center of the connected devi

PSD3: The Next Phase in Europe’s Payment Services Regulation

With the successful rollout of PSD2, the European Union (EU) continues to advance innovation in the payments domain through the anticipated introduction of the   Payment Services Directive 3 (PSD3) . On June 28, 2023, the European Commission published a draft proposal for PSD3 and the   Payment Services Regulation (PSR) . The finalized versions of this directive and associated regulation are expected to be available by late 2024, although some predictions suggest a more likely timeline of Q2 or Q3 2025. Given that member states are typically granted an 18-month transition period, PSD3 is expected to come into effect sometime in 2026. Notably, the Commission has introduced a regulation (PSR) alongside the PSD3 directive, ensuring more harmonization across member states as regulations are immediately effective and do not require national implementation, unlike directives. PSD3 shares the same objectives as PSD2, i.e.   increasing competition in the payments landscape and enhancing consum

Trade-offs Are Inevitable in Software Delivery - Remember the CAP Theorem

In the world of financial services, the integrity of data systems is fundamentally reliant on   non-functional requirements (NFRs)   such as reliability and security. Despite their importance, NFRs often receive secondary consideration during project scoping, typically being reduced to a generic checklist aimed more at compliance than at genuine functionality. Regrettably, these initial NFRs are seldom met after delivery, which does not usually prevent deployment to production due to the vague and unrealistic nature of the original specifications. This common scenario results in significant end-user frustration as the system does not perform as expected, often being less stable or slower than anticipated. This situation underscores the need for   better education on how to articulate and define NFRs , i.e. demanding only what is truly necessary and feasible within the given budget. Early and transparent discussions can lead to system architecture being tailored more closely to realisti

An overview of 1-year blogging

Last week I published my   60th post   on my blog called   Bankloch   (a reference to "Banking" and my family name). The past year, I have published a blog on a weekly basis, providing my humble personal vision on the topics of Fintech, IT software delivery and mobility. This blogging has mainly been a   personal enrichment , as it forced me to dive deep into a number of different topics, not only in researching for content, but also in trying to identify trends, innovations and patterns into these topics. Furthermore it allowed me to have several very interesting conversations and discussions with passionate colleagues in the financial industry and to get more insights into the wonderful world of blogging and more general of digital marketing, exploring subjects and tools like: Search Engine Optimization (SEO) LinkedIn post optimization Google Search Console Google AdWorks Google Blogger Thinker360 Finextra …​ Clearly it is   not easy to get the necessary attention . With th

Low- and No-code platforms - Will IT developers soon be out of a job?

“ The future of coding is no coding at all ” - Chris Wanstrath (CEO at GitHub). Mid May I posted a blog on RPA (Robotic Process Automation -   https://bankloch.blogspot.com/2020/05/rpa-miracle-solution-for-incumbent.html ) on how this technology, promises the world to companies. A very similar story is found with low- and no-code platforms, which also promise that business people, with limited to no knowledge of IT, can create complex business applications. These   platforms originate , just as RPA tools,   from the growing demand for IT developments , while IT cannot keep up with the available capacity. As a result, an enormous gap between IT teams and business demands is created, which is often filled by shadow-IT departments, which extend the IT workforce and create business tools in Excel, Access, WordPress…​ Unfortunately these tools built in shadow-IT departments arrive very soon at their limits, as they don’t support the required non-functional requirements (like high availabili

Beyond Imagination: The Rise and Evolution of Generative AI Tools

Generative AI   has revolutionized the way we create and interact with digital content. Since the launch of Dall-E in July 2022 and ChatGPT in November 2022, the field has seen unprecedented growth. This technology, initially popularized by OpenAI’s ChatGPT, has now been embraced by major tech players like Microsoft and Google, as well as a plethora of innovative startups. These advancements offer solutions for generating a diverse range of outputs including text, images, video, audio, and other media from simple prompts. The consumer now has a vast array of options based on their specific   output needs and use cases . From generic, large-scale, multi-modal models like OpenAI’s ChatGPT and Google’s Bard to specialized solutions tailored for specific use cases and sectors like finance and legal advice, the choices are vast and varied. For instance, in the financial sector, tools like BloombergGPT ( https://www.bloomberg.com/ ), FinGPT ( https://fin-gpt.org/ ), StockGPT ( https://www.as

Deals as a competitive differentiator in the financial sector

In my blog " Customer acquisition cost: probably the most valuable metric for Fintechs " ( https://bankloch.blogspot.com/2020/06/customer-acquisition-cost-probably-most.html ) I described how a customer acquisition strategy can make or break a Fintech. In the traditional Retail sector, focused on selling different types of products for personal usage to end-customers,   customer acquisition  is just as important. No wonder that the advertisement sector is a multi-billion dollar industry. However in recent years due to the digitalization and consequently the rise of   Digital Marketing , customer acquisition has become much more focused on   delivering the right message via the right channel to the right person on the right time . Big tech players like Google and Facebook are specialized in this kind of targeted marketing, which is a key factor for their success and multi-billion valuations. Their exponential growth in marketing revenues seems however coming to a halt, as digi