Skip to main content

Microservices monitoring: disaggregate to optimize, aggregate to operate


Many organizations are changing their architecture from a monolithic architecture to a microservice-based architecture. Although this transformation gives many benefits, with regards to organisation, flexibility (due to isolation) and development speed, it also comes with quite some challenges. Most of those challenges are linked to the increased orchestration complexities introduced by microservices, which make monitoring and observability more difficult. The fact that the microservices are developed in different technologies and deployed on an infrastructure of multiple servers (often dynamically allocated by the container management system to achieve elastic scalability and high availability) doesn’t help either (and upcoming evolutions like Functions as a Service will make it even worse).
Combine this with the fact that software development has become more agile and more "try-and-adjust" oriented instead of the first-time-right philosophy of a few years ago, monitoring every aspect of your system is more crucial than ever. In the spirit of Peter Drucker’s teaching "If you can’t measure it, you can’t improve it," measuring and monitoring become the cornerstone of modern product management.
Unfortunately, in many organisations monitoring is still very immature. Managers are often astounded by the fact that a simple question of an end-user like "My order was not executed" takes hours to analyse and often results in only a guess about the explanation. This leads to a lot of frustration for all involved parties and does not allow the organisation to become fault-tolerant, i.e. one that continuously improves itself (when you cannot find the root cause of an issue, it is impossible to resolve it).
Therefore, it comes as no surprise, that people are looking for tools which
·        Reduce (and resolve) the monitoring and observability challenges
·        Offer the visibility and understanding of the holistic application architecture
·        Provide actionable outputs in case of issues

Luckily technologies that provide this end-to-end view on each business process are continuously improving, but nonetheless there is still a long way to go before arriving at a solution that allows organizations to easily analyse and reproduce any issue. The introduction of a technology like distributed tracing and monitoring has been a huge step forward, but many problems remain, such as:
·        If the end-user experiencing the issue can provide the unique trace ID, it is easy to analyse the trace, but if not, more complex searches on user ID or timestamp are required, which are far from straightforward:
·       Searching on user ID used to be common, but today the concept of a user is becoming less and less clear, as applications are multi-channel, used by many partners through APIs, and connected directly to IoT devices.
·       Searching on timestamp is also complex due to the large number of service calls, which means that even a time span of a few minutes can already include thousands of requests to search through.
·        When the cause of an issue has been found, the issue should be reproduced in a test environment in order to fix it. With a monolithic application, often a scrambled (removing all personal and sensitive data) production back-up is restored on the test environment after which the issue can easily be investigated, reproduced, and retested (once the fix is implemented). With a microservice-based architecture with many loose components, each having its own database (schema), this is a completely different story.
·        The multitude of different tools, often open-source (Kubernetes, Istio, Prometheus, Kibana, Grafana, ElasticSearch, Zipkin…​) but also vendor specific (Capterra lists 132 vendors in the space of Application Performance Measurement alone), each providing a part of the monitoring puzzle (one tool for monitoring, another for visualization, and yet another for metrics storage​), makes it very hard to get an end-to-end view. For example, service meshes like Istio can provide a common solution, allowing to monitor/trace all ingoing/outgoing traffic, but in order to get details from inside the microservice, developers need to instrument their code directly.

In short, we can conclude that disaggregation introduced by microservices to optimize development speed needs to be aggregated again to efficiently operate.
Ultimate, wouldn’t you dream of a single monitoring platform, which incorporates all aspects of monitoring (logging, alerting, metrics/SLAs, dashboards…​) in a single consolidated user cockpit. Such a tool could proactively inform you about anomalies, allow the user to zoom in and out on any every event occurring in the company and resolve the operational complexities introduced by microservices.

Such a platform would provide basic features like:
·        Distributed tracing, like the ability to follow a request throughout its lifecycle
·        Aggregation of log files, with the ability to quickly search inside log files
·        Dashboarding:
·       Providing different views (like performance, availability, data flows), allowing to answer questions like which queries have the slowest response times, which URLs are seeing the most errors, which services are using the most CPU, and where the bottlenecks with regards to availability are​
·       Allowing to easily drill down from the logical application level into physical containers (i.e., from business monitoring to application monitoring to technical/infrastructure monitoring). This can be especially challenging as containers have shorter and shorter lifetimes inside the container orchestration system.
·       On-the-fly aggregation, making it possible to aggregate information from all the containers associated with a function or a service. For example, being able to determine the total disk usage associated with a service and see the impact of downtime of a container on the overall business service.
·        KPIs/Metrics: the platform should calculate KPIs/metrics at different levels (from business up to infrastructure level), with a focus on business-level metrics but with the ability to drill down to lower-level metrics.
·        Alerting: the ability to generate alerts in case of specific errors or certain KPIs passing out of boundaries, with the possibility to:
·       Drill down on an alert from user/business level to application level, to container level up to infrastructure level.
·       Link runbooks to each alert
·       Automatically derive the business impact of a lower-level alert, i.e., abstract a lower-level alert to higher-level, making it so you can immediately explain what the business impacts of a lower-level issue are (e.g., what is the negative impact of a disk space issue on the business transactions).

Many platforms like e.g. Sensu (www.sensu.io) and Datadog already provide most of these functionalities and provide predefined integrations with specialized tools such as metrics analytics tools (e.g. Elasticsearch, Splunk…), incident tools (e.g. PagerDuty, ServiceNow, VictorOps…) or DevOps tools (e.g. Git, Ansible, Jenkins…).

But this would only be a beginning. More in-depth integrations and features would create an operational user cockpit:
·        Data security: in order to debug and reproduce an issue, ideally you would like to retrieve all in- and outputs of each involved microservice. Unfortunately, these in- and outputs often contain confidential customer data, which requires the same level of security as the operational database. Solutions should therefore be found to store these in- and outputs in a secure (encrypted) database but be able to join them at run-time with the logging information, when desired. The monitoring tool should take care of this joining in the most secure way possible, i.e. enforcing security roles, auditing all data accesses, and scrambling personal data.
·        Proactive monitoring: most monitoring is still reactive monitoring - an issue occurs on the system, someone is alerted, and further investigation is done. This should evolve to a proactive monitoring, which identifies (potentially helped by an AI model) negative trends before they cause an incident. This can be very challenging as generating too many false positives can lead to a lot of lost time. Furthermore, as systems become more and more dynamic (continuous deployments, dynamic infrastructure, etc.​) and usage can also be peaky, it becomes more and more difficult to predict abnormal trends.
·        Engine to automatically trigger actions (easily configurable via rule engine), based on certain alerts. E.g., automatically provision extra servers/disks in case of CPU/disk space issues, automatically shut down services (i.e. Circuit Breaker pattern) consuming too many resources to avoid one failing service tearing down the whole system, and automatic rate limiting (throttling) in case of degrading performance due to overload.
·        Service Level Agreement management and monitoring: as indicated above, the monitoring platform should measure all KPIs and metrics, but ideally the tool should also be able to configure all SLAs, which the organisation (at different levels, such as business, application, and infrastructure) should honor. This way breaches of SLAs can be immediately alerted and SLA reporting can be easily generated.
·        The monitoring tool should also be able to identify fraud and security breaches (hacking attacks). Such activities will typically result in abnormal patterns in certain flows, which can be automatically identified by the monitoring platform. For example,. abnormal CPU activity could be a sign of cryptojacking or large data transfers to an undefined IP address could be an indication of data theft.
·        Integration with website and app analytics tools (like Google Analytics, Adobe Analytics and Google Firebase). These tools capture the key information of the usage of the website and collect info like number of visitors, where they’re coming from, and what pages of the website they are visiting. If this info can be automatically combined with the data available in the monitoring platform, much more detailed insights can be obtained. E.g., website analytics tools might indicate that users drop out in a sales funnel on step 3, while the integration with the monitoring tool could help analyse the cause of (part of) the drop-out, such as performance or availability issues.
·        Integration with chaos engineering (e.g. Netflix Simian Army or Gremlin): more and more companies are introducing chaos engineering, or resilience testing, directly into the production environment. These deliberately created incidents, on which the system should be able to respond in an automatic way, should be identified and filtered out by the monitoring system, to avoid investigating irrelevant issues. However, if the system did not respond correctly to an introduced fault, this should still be reported.
·        Integration with release management tools and CI/CD pipelines (e.g. GitLab, Jenkins, CircleCI, TeamCity, Bamboo, GoCD…): a deep integration of the monitoring platform with these tools can be incredibly useful to:
·       Identify (and filter out) any issues linked to a deployment, with known availability impacts.
·       Drill-down on a monitoring issue to see all related deployments and even the related source code commits.
·       Automatically rollback a deployment in case of regressions identified in the KPIs/metrics.
·       Automatically roll-forward in a gradual (canary) deployment, i.e., increase the target audience of new version.
·       Easy comparison of all metrics in an A/B deployment.
·       View historical graphs of usage/availability/performance, etc.,​ and be able to link it with relevant version changes. This can enable users to pin-point gradual degradations to a past code change.
·        Integration with a feature flag tool (e.g. Rollout, LaunchDarkly, Optimizely…), which allows users to activate/deactivate features based on different segmentations. Just like the integration with the release management tool and CI/CD pipeline, this integration would automatically rollback and roll-forward the activation of new features based on monitoring metrics.
·        Integration with defect management system (e.g. JIRA, ALM…), to automatically create (including the automatic attaching of logging and monitoring information to the defect) and close defects based on the observations in the monitoring tool.
·        Integration with chatbox and chatbots, allowing easy communication of alerts, but also allowing easy retrieval of monitoring information and investigation of incidents by chatting with the chatbot.
·        Integration with crash reporting tools, which collect all information at the end-user side (i.e., all device, browser, and application information) in case of a crash. This information should also be stored in the monitoring platform, so that this info can be linked to a trace and with all other features described above (like defect management system, and chatbox/chatbots​).
·        Easy exposure of business metrics via APIs or widgets (with possibilities for basic look-and-feel customizations), displayed via monitoring dashboards that are fully integrated in an end-user application.
·        Automatically generate workflow diagrams and documentation, based on the actual execution of processes. This allows users to get up-to-date documentation of all processes, based on actual flows, rather than a process analyst having to manually update the documentation of how a system is supposed to work (which might be different from the actual behaviour).

Most monitoring players, like e.g. Sensu (www.sensu.io), Datadog…, but also players in the DevOps space (e.g. GitLab), have already understood these needs, so heavy investments are already being made in this space. At my knowledge there is however no player in the market yet which offers all those functionalities. Some players provide multiple pre-built integration with different specialized tools, but as there is often an overlap in functionality and only a data integration (not an integration of front-end layer), this gives not yet the ideal user experience of a fully integrated cockpit.

I for one, am looking forward to a bright future, where being-on-call will no longer be such a burden. The monitoring platform would not only automatically resolve most standard issues, but in case a manual intervention is required, you will have all the information at the tip of your fingers. Having to wake up at 4 o’clock on a Saturday night will never be a dream, but let’s no longer make it a nightmare.

Comments

  1. This post is extremely radiant. I extremely like this post. It is outstanding amongst other posts that I’ve read in quite a while. Much obliged for this better than average post. I truly
    vat return service in barking

    ReplyDelete

Post a Comment

Popular posts from this blog

Transforming the insurance sector to an Open API Ecosystem

1. Introduction "Open" has recently become a new buzzword in the financial services industry, i.e.   open data, open APIs, Open Banking, Open Insurance …​, but what does this new buzzword really mean? "Open" refers to the capability of companies to expose their services to the outside world, so that   external partners or even competitors   can use these services to bring added value to their customers. This trend is made possible by the technological evolution of   open APIs (Application Programming Interfaces), which are the   digital ports making this communication possible. Together companies, interconnected through open APIs, form a true   API ecosystem , offering best-of-breed customer experience, by combining the digital services offered by multiple companies. In the   technology sector   this evolution has been ongoing for multiple years (think about the travelling sector, allowing you to book any hotel online). An excelle...

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

Hypes and marketing buzz words are strongly present in the IT landscape. Often these are existing concepts, which have evolved technologically and are then renamed to a new term, as if it were a brand new technology or concept. If you want to understand and assess these new trends, it is important to   reduce the concepts to their essence and compare them with existing technologies , e.g. Integration (middleware) software   ensures that 2 separate applications or components can be integrated in an easy way. Of course, there is a huge evolution in the protocols, volumes of exchanged data, scalability, performance…​, but in essence the problem remains the same. Nonetheless, there have been multiple terms for integration software such as ETL, ESB, EAI, SOA, Service Mesh…​ Data storage software   ensures that data is stored in such a way that data is not lost and that there is some kind guaranteed consistency, maximum availability and scalability, easy retrieval...

IoT - Revolution or Evolution in the Financial Services Industry

1. The IoT hype We have all heard about the   "Internet of Things" (IoT)   as this revolutionary new technology, which will radically change our lives. But is it really such a revolution and will it really have an impact on the Financial Services Industry? To refresh our memory, the Internet of Things (IoT) refers to any   object , which is able to   collect data and communicate and share this information (like condition, geolocation…​)   over the internet . This communication will often occur between 2 objects (i.e. not involving any human), which is often referred to as Machine-to-Machine (M2M) communication. Well known examples are home thermostats, home security systems, fitness and health monitors, wearables…​ This all seems futuristic, but   smartphones, tablets and smartwatches   can also be considered as IoT devices. More importantly, beside these futuristic visions of IoT, the smartphone will most likely continue to be the cent...

Are product silos in a bank inevitable?

Silo thinking   is often frowned upon in the industry. It is often a synonym for bureaucratic processes and politics and in almost every article describing the threats of new innovative Fintech players on the banking industry, the strong bank product silos are put forward as one of the main blockages why incumbent banks are not able to (quickly) react to the changing customer expectations. Customers want solutions to their problems   and do not want to be bothered about the internal organisation of their bank. Most banks are however organized by product domain (daily banking, investments and lending) and by customer segmentation (retail banking, private banking, SMEs and corporates). This division is reflected both at business and IT side and almost automatically leads to the creation of silos. It is however difficult to reorganize a bank without creating new silos or introducing other types of issues and inefficiencies. An organization is never ideal and needs to take a numbe...

PSD3: The Next Phase in Europe’s Payment Services Regulation

With the successful rollout of PSD2, the European Union (EU) continues to advance innovation in the payments domain through the anticipated introduction of the   Payment Services Directive 3 (PSD3) . On June 28, 2023, the European Commission published a draft proposal for PSD3 and the   Payment Services Regulation (PSR) . The finalized versions of this directive and associated regulation are expected to be available by late 2024, although some predictions suggest a more likely timeline of Q2 or Q3 2025. Given that member states are typically granted an 18-month transition period, PSD3 is expected to come into effect sometime in 2026. Notably, the Commission has introduced a regulation (PSR) alongside the PSD3 directive, ensuring more harmonization across member states as regulations are immediately effective and do not require national implementation, unlike directives. PSD3 shares the same objectives as PSD2, i.e.   increasing competition in the payments landscape and en...

Trade-offs Are Inevitable in Software Delivery - Remember the CAP Theorem

In the world of financial services, the integrity of data systems is fundamentally reliant on   non-functional requirements (NFRs)   such as reliability and security. Despite their importance, NFRs often receive secondary consideration during project scoping, typically being reduced to a generic checklist aimed more at compliance than at genuine functionality. Regrettably, these initial NFRs are seldom met after delivery, which does not usually prevent deployment to production due to the vague and unrealistic nature of the original specifications. This common scenario results in significant end-user frustration as the system does not perform as expected, often being less stable or slower than anticipated. This situation underscores the need for   better education on how to articulate and define NFRs , i.e. demanding only what is truly necessary and feasible within the given budget. Early and transparent discussions can lead to system architecture being tailored more close...

Low- and No-code platforms - Will IT developers soon be out of a job?

“ The future of coding is no coding at all ” - Chris Wanstrath (CEO at GitHub). Mid May I posted a blog on RPA (Robotic Process Automation -   https://bankloch.blogspot.com/2020/05/rpa-miracle-solution-for-incumbent.html ) on how this technology, promises the world to companies. A very similar story is found with low- and no-code platforms, which also promise that business people, with limited to no knowledge of IT, can create complex business applications. These   platforms originate , just as RPA tools,   from the growing demand for IT developments , while IT cannot keep up with the available capacity. As a result, an enormous gap between IT teams and business demands is created, which is often filled by shadow-IT departments, which extend the IT workforce and create business tools in Excel, Access, WordPress…​ Unfortunately these tools built in shadow-IT departments arrive very soon at their limits, as they don’t support the required non-functional requirements (like h...

An overview of 1-year blogging

Last week I published my   60th post   on my blog called   Bankloch   (a reference to "Banking" and my family name). The past year, I have published a blog on a weekly basis, providing my humble personal vision on the topics of Fintech, IT software delivery and mobility. This blogging has mainly been a   personal enrichment , as it forced me to dive deep into a number of different topics, not only in researching for content, but also in trying to identify trends, innovations and patterns into these topics. Furthermore it allowed me to have several very interesting conversations and discussions with passionate colleagues in the financial industry and to get more insights into the wonderful world of blogging and more general of digital marketing, exploring subjects and tools like: Search Engine Optimization (SEO) LinkedIn post optimization Google Search Console Google AdWorks Google Blogger Thinker360 Finextra …​ Clearly it is   not easy to get the necessary ...

The UPI Phenomenon: From Zero to 10 Billion

If there is one Indian innovation that has grabbed   global headlines , it is undoubtedly the instant payment system   UPI (Unified Payments Interface) . In August 2023, monthly UPI transactions exceeded an astounding 10 billion, marking a remarkable milestone for India’s payments ecosystem. No wonder that UPI has not only revolutionized transactions in India but has also gained international recognition for its remarkable growth. Launched in 2016 by the   National Payments Corporation of India (NPCI)   in collaboration with 21 member banks, UPI quickly became popular among consumers and businesses. In just a few years, it achieved   remarkable milestones : By August 2023, UPI recorded an unprecedented   10.58 billion transactions , with an impressive 50% year-on-year growth. This volume represented approximately   190 billion euros . In July 2023, the UPI network connected   473 different banks . UPI is projected to achieve a staggering   1 ...

AI in Financial Services - A buzzword that is here to stay!

In a few of my most recent blogs I tried to   demystify some of the buzzwords   (like blockchain, Low- and No-Code platforms, RPA…​), which are commonly used in the financial services industry. These buzzwords often entail interesting innovations, but contrary to their promise, they are not silver bullets solving any problem. Another such buzzword is   AI   (or also referred to as Machine Learning, Deep Learning, Enforced Learning…​ - the difference between those terms put aside). Again this term is also seriously hyped, creating unrealistic expectations, but contrary to many other buzzwords, this is something I truly believe will have a much larger impact on the financial services industry than many other buzzwords. This opinion is backed by a study of McKinsey and PWC indicating that 72% of company leaders consider that AI will be the most competitive advantage of the future and that this technology will be the most disruptive force in the decades to come. Deep Lea...