The tech-companies of Silicon Valley (with Facebook one of its most notable proponents) embrace the attitude of "Move Fast and Break Things", "Sell/Ship First, Fix Later" and "Growth at any cost". This philosophy promotes a fast time-to-market (and being first on the market) over quality. The idea is that if you aren’t breaking things you’re delivering value too slowly. In a world with continuous evolution (especially in the tech sector), this seemed the right way forward.
This is in strong contrast with the large established financial service companies, which have often only a few releases a year, with multiple test-phases and test-cycles, which often take months to complete.
As always in these types of evolutions, the 2 extremes (productivity/time-to-market versus predictability/quality) tend to converge to a middle ground. On the one hand the incumbent banks are adopting Agile and DevOps methodologies and practices, allowing for faster and more gradual (and with roll-back) delivery. On the other hand, the major tech-companies with millions of users have also understood that even a small bug can have enormous impacts given the size of their user base. Not surprisingly Facebook changed its motto "Move Fast and Break Things" in 2014 to "Move Fast with Stable Infra".
In the end there is no golden rule for how software should be delivered. The degree of testing required, and the associated speed of delivery depends on a number of factors:
- What is the impact of a bug (i.e. cost of failure), determined by the criticality of the application, the number of impacted users, the impact of a bug (e.g. impact on human health/human lives, financial impacts, legal/security implications…)…
- Speed at which feedback can be obtained from users. Typically, a front-end with good feedback mechanisms through monitoring allows faster feedback, than a complex back-end process.
- Ease to roll-back the change, i.e. how easy can a change be rolled-back. E.g. an installation where a database structure is modified is much more complex to roll-back than a new version of a front-end screen.
- Possibility to roll-back the resulting impact of the software, e.g. for an online view-only screen or report there is no roll-back required of the results of the erroneous software, while software which sends out mails or SMS’s or software which makes calculations and bulk updates in a database can be very complex and difficult to roll-back the results (even if the software can be easily rolled-back).
- Time to debug a bug in the code, i.e. how well is the code structured and documented, how much knowledge is there in the development team of the code base, how easily can a specific situation be reproduced in a test environment…
- What is the cost of doing a test, i.e. how easily can tests be automated, which skills and coordination is needed, how easy can test data be setup…
Companies should therefore not have 1 unique policy and 1 release calendar applicable to all changes, but instead allow a categorization in the changes, depending on the above criteria. Depending on the categorization a faster/slower release schedule with less/more testing is needed.
Typically, you could work with 3 categories, i.e.
- Low impact changes, for which software and the resulting impact can be easily rolled-back could be deployed to production immediately without any manual verification (i.e. only some automated tests). This type of changes could also be deployed directly to the full user base.
- Medium impact changes, for which the software and resulting impact can be easily rolled-back, can also be deployed with little to no manual testing, but require a gradual deployment/roll-out. As it is difficult to have a gradual roll-out for every small change, it is best to work with a lower release frequency, like e.g. bi-weekly.
- High impact changes (i.e. mission-critical) or changes for which software or the resulting impact cannot be easily rolled-back should get extensive testing cycles and a gradual roll-out (move slowly and steadily). This means that releasing on a bi-monthly basis is probably the maximum obtainable frequency.
Of course, with these different release frequencies, it is difficult to execute the testing for a change in the last category, as the surrounding software can evolve during the test-cycles. It is therefore critical to isolate software components as much as possible through encapsulation and to automate as much as possible any testing, so that (regression) testing cycles can be repeated very quickly.
In big financial service companies, it is difficult to introduce the first 2 categories, as it is in contradiction to the DNA of a typical bank or insurance company, which aims to manage (avoid or mitigate) risks in the best possible way (i.e. the industry is very sensitive and critical and the reputation of a financial service company as a safe-haven is essential).
However if banks and insurance companies want to become again innovative companies, they need to try out new things, measure the impact and learn from it (probe, sense and respond). It means that releasing should be a continuous effort (at least to the test environment and for the first and second category of changes also to the production environment), which can be obtained via:
However if banks and insurance companies want to become again innovative companies, they need to try out new things, measure the impact and learn from it (probe, sense and respond). It means that releasing should be a continuous effort (at least to the test environment and for the first and second category of changes also to the production environment), which can be obtained via:
- Advanced DevOps practices, like CI/CD pipelines, continuous monitoring, Infrastructure as Code, automated testing, fast and automated roll-backs…
- Deployment strategies, like feature flags, canary testing, A/B testing…
- Well trained and motivated software engineers, which allow to deliver high-quality code, have a good understanding of the business impacts and are able to analyse and fix potential bugs very quickly.
- IT development teams having (almost) direct contact with end-users, so that feedback cycles are instantaneous and IT departments have a good connection with the business world (ownership and commitment). This in contrast to the current way of working, where IT developers are often 4-5 layers away from the end-user (i.e. often separated by key users, business/product owners, business analysts, functional analysts, technical analysts…).
- Good application architecture, where a maximum encapsulation between modules is realized (e.g. via a microservices based architecture) and were fault isolation is built in the system (cfr. my blog on Building resilient systems - https://bankloch.blogspot.com/2020/02/building-resilient-systems-in-financial.html)
- Good logging and monitoring, allowing automatic identification of issues and easy debugging via drill-down capabilities from business metrics to application metrics, all the way to log entries (cfr. my blog on monitoring - https://bankloch.blogspot.com/2020/03/microservices-monitoring-disaggregate.html)
- An agile governance structure, i.e. no complex quality governance with reviews, sign-offs points, quality gates…, but instead a governance where end-to-end ownership is promoted.
The best way to realize this change is by realizing that even the best-tested software still has bugs (i.e. it is impossible to create 100% defect free software and impossible to test everything).
As a result of this realization, banks and insurance companies should focus less on trying to eliminate all bugs before going to production, but instead identify bugs in production as quickly as possible (getting fast feedback of end-users and ensuring that this feedback reaches the IT development team as quickly as possible) and react as fast as possible to them (quick bug fixing delivery cycles). Often a bug can even be a positive experience for an end-user, when the bug can be fixed quickly and there is good communication around it. At that moment the end-user reporting the bug will feel appreciated and a more in-depth relationship can be obtained. Today most banks and insurers leave however production bugs sometimes months unsolved (due to other priorities and the above-mentioned release cycles), which leads to enormous frustration of end-users.
As a result of this realization, banks and insurance companies should focus less on trying to eliminate all bugs before going to production, but instead identify bugs in production as quickly as possible (getting fast feedback of end-users and ensuring that this feedback reaches the IT development team as quickly as possible) and react as fast as possible to them (quick bug fixing delivery cycles). Often a bug can even be a positive experience for an end-user, when the bug can be fixed quickly and there is good communication around it. At that moment the end-user reporting the bug will feel appreciated and a more in-depth relationship can be obtained. Today most banks and insurers leave however production bugs sometimes months unsolved (due to other priorities and the above-mentioned release cycles), which leads to enormous frustration of end-users.
Furthermore, one should realize that the long release cycles come at a major cost. Due to the enormous cost of late delivery (i.e. missing the gate for a release and impacting all dependent projects), IT teams start to optimize their work for deadlines, which reduces agility and cooperation, ultimately lowering overall productivity.
But even if the above transformation is realized, the extensive testing phases and cycles are still required for the last category of changes. With testing consuming typically around 30% of the project budget, one should definitely look at testing as a possible candidate for cost reduction.
Testing is important and requires a specific skill set, but my personal experience is that testing teams often tend to lose themselves in methodology which doesn’t always deliver so much result. Below are some reflections on possible cost reductions with regards to testing:
- Unit testing: in the Waterfall model these types of tests were critical, as time between code writing and the functional testing (i.e. feedback on quality) was very long, meaning that the developer no longer had the code in mind, which resulted in high costs to fix a bug (exponentially increased in every testing cycle).
Today with the shorter test cycles and the automated tests executed in CI/CD pipelines, certain functional tests might be executed only minutes after the code commit. One should therefore wonder if it is not better to skip unit tests and replace them by extensive more elaborate functional end-to-end testing. Such end-to-end tests are more complete, easier to understand, less prone to maintenance and easier to manage. Having seen too many projects, where the time to update the unit tests following a small change took 2-3 times more time than the change itself, this is definitely a candidate for cost reduction.
Often developers argument that writing unit tests also helps to better understand and design the code and makes it easier to maintain the code, as the unit tests can act as a structured way of documenting the code. While I fully agree with these arguments (don’t think anybody is claiming that unit tests are a bad thing), they don’t make the business case (i.e. is the time spent on creating and maintaining unit tests ultimately gained later in time). - Dedicated testing teams: while testing definitely requires a specific skill set, which is different from design or development, it is (in my belief) a bad idea to work with separate testing teams (competence centres), which execute tests from different projects, as they often have insufficient knowledge of the business, the project and the applications. This lack of in-depth knowledge usually leads to a too strong focus on UX testing, resulting in dozens of small, cosmetic defects, while critical business bugs are overlooked. It is therefore better to work with a dedicated tester within the scrum team, who assists also in the design process, but has as primary focus the quality assurance within the team.
- Test preparations: a good test preparation (and coordination between all involved parties) is essential to deliver good software. Nonetheless I don’t think writing detailed test cases and test scripts is the right way to achieve this. These deliverables are typically very time consuming to create and maintain and include a lot of repetitive information, which a tester who is well acquainted with the business and the application doesn’t require to execute his tests. Instead it is much more efficient to create a test matrix, which contains the different combinations of flows/decision points. Such a matrix takes considerably less time to create and maintain and shows immediately the different areas to focus on during testing.
- Test automation: test automation is not cheap (to create and to maintain) and can result in quite some complexity (which is also prone to errors) to setup. As a general guideline, you could say that setting up an automated test takes 5 to 10 times more effort than manually executing the test. Automation is therefore only relevant, when you know that the test will be executed at least 10 times. However, with CI/CD pipelines executing continuous testing, this number is often reached in a few days. Test automation is therefore intrinsically linked to Agile software delivery.
The 2 most common areas for test automation are the automated calling of APIs and robots simulating user interactions on screens. If resources capable of test automation are scarce one should focus first on automating API calls, as these are much easier to automate and less prone to change (i.e. tests are more regression proof and APIs evolve less quickly than the front-end screens). - User acceptance testing: try to involve as soon as possible end-users in the testing (and not only in the final acceptance testing phase) as these users can give very valuable feedback in a very short time. Of course, the time of these resources should be used as efficient as possible and expectations should be managed carefully, when these users are involved early in the testing process.
Independent of the chosen approach and methodology, the human aspect should never be underestimated. Software engineers that have a feeling of ownership, commitment and sense of urgency, will deliver better software quality at a faster pace, which ultimately leads to more motivation and further enforces the feeling of vibrancy around the company. This (r)evolution will however only happen when companies allow themselves to build imperfect systems (embrace and reward initiative and the failure that comes with it).
Check out all my blogs on https://bankloch.blogspot.com/
Comments
Post a Comment