Skip to main content

Microservices monitoring: disaggregate to optimize, aggregate to operate


Many organizations are changing their architecture from a monolithic architecture to a microservice-based architecture. Although this transformation gives many benefits, with regards to organisation, flexibility (due to isolation) and development speed, it also comes with quite some challenges. Most of those challenges are linked to the increased orchestration complexities introduced by microservices, which make monitoring and observability more difficult. The fact that the microservices are developed in different technologies and deployed on an infrastructure of multiple servers (often dynamically allocated by the container management system to achieve elastic scalability and high availability) doesn’t help either (and upcoming evolutions like Functions as a Service will make it even worse).
Combine this with the fact that software development has become more agile and more "try-and-adjust" oriented instead of the first-time-right philosophy of a few years ago, monitoring every aspect of your system is more crucial than ever. In the spirit of Peter Drucker’s teaching "If you can’t measure it, you can’t improve it," measuring and monitoring become the cornerstone of modern product management.
Unfortunately, in many organisations monitoring is still very immature. Managers are often astounded by the fact that a simple question of an end-user like "My order was not executed" takes hours to analyse and often results in only a guess about the explanation. This leads to a lot of frustration for all involved parties and does not allow the organisation to become fault-tolerant, i.e. one that continuously improves itself (when you cannot find the root cause of an issue, it is impossible to resolve it).
Therefore, it comes as no surprise, that people are looking for tools which
·        Reduce (and resolve) the monitoring and observability challenges
·        Offer the visibility and understanding of the holistic application architecture
·        Provide actionable outputs in case of issues

Luckily technologies that provide this end-to-end view on each business process are continuously improving, but nonetheless there is still a long way to go before arriving at a solution that allows organizations to easily analyse and reproduce any issue. The introduction of a technology like distributed tracing and monitoring has been a huge step forward, but many problems remain, such as:
·        If the end-user experiencing the issue can provide the unique trace ID, it is easy to analyse the trace, but if not, more complex searches on user ID or timestamp are required, which are far from straightforward:
·       Searching on user ID used to be common, but today the concept of a user is becoming less and less clear, as applications are multi-channel, used by many partners through APIs, and connected directly to IoT devices.
·       Searching on timestamp is also complex due to the large number of service calls, which means that even a time span of a few minutes can already include thousands of requests to search through.
·        When the cause of an issue has been found, the issue should be reproduced in a test environment in order to fix it. With a monolithic application, often a scrambled (removing all personal and sensitive data) production back-up is restored on the test environment after which the issue can easily be investigated, reproduced, and retested (once the fix is implemented). With a microservice-based architecture with many loose components, each having its own database (schema), this is a completely different story.
·        The multitude of different tools, often open-source (Kubernetes, Istio, Prometheus, Kibana, Grafana, ElasticSearch, Zipkin…​) but also vendor specific (Capterra lists 132 vendors in the space of Application Performance Measurement alone), each providing a part of the monitoring puzzle (one tool for monitoring, another for visualization, and yet another for metrics storage​), makes it very hard to get an end-to-end view. For example, service meshes like Istio can provide a common solution, allowing to monitor/trace all ingoing/outgoing traffic, but in order to get details from inside the microservice, developers need to instrument their code directly.

In short, we can conclude that disaggregation introduced by microservices to optimize development speed needs to be aggregated again to efficiently operate.
Ultimate, wouldn’t you dream of a single monitoring platform, which incorporates all aspects of monitoring (logging, alerting, metrics/SLAs, dashboards…​) in a single consolidated user cockpit. Such a tool could proactively inform you about anomalies, allow the user to zoom in and out on any every event occurring in the company and resolve the operational complexities introduced by microservices.

Such a platform would provide basic features like:
·        Distributed tracing, like the ability to follow a request throughout its lifecycle
·        Aggregation of log files, with the ability to quickly search inside log files
·        Dashboarding:
·       Providing different views (like performance, availability, data flows), allowing to answer questions like which queries have the slowest response times, which URLs are seeing the most errors, which services are using the most CPU, and where the bottlenecks with regards to availability are​
·       Allowing to easily drill down from the logical application level into physical containers (i.e., from business monitoring to application monitoring to technical/infrastructure monitoring). This can be especially challenging as containers have shorter and shorter lifetimes inside the container orchestration system.
·       On-the-fly aggregation, making it possible to aggregate information from all the containers associated with a function or a service. For example, being able to determine the total disk usage associated with a service and see the impact of downtime of a container on the overall business service.
·        KPIs/Metrics: the platform should calculate KPIs/metrics at different levels (from business up to infrastructure level), with a focus on business-level metrics but with the ability to drill down to lower-level metrics.
·        Alerting: the ability to generate alerts in case of specific errors or certain KPIs passing out of boundaries, with the possibility to:
·       Drill down on an alert from user/business level to application level, to container level up to infrastructure level.
·       Link runbooks to each alert
·       Automatically derive the business impact of a lower-level alert, i.e., abstract a lower-level alert to higher-level, making it so you can immediately explain what the business impacts of a lower-level issue are (e.g., what is the negative impact of a disk space issue on the business transactions).

Many platforms like e.g. Sensu (www.sensu.io) and Datadog already provide most of these functionalities and provide predefined integrations with specialized tools such as metrics analytics tools (e.g. Elasticsearch, Splunk…), incident tools (e.g. PagerDuty, ServiceNow, VictorOps…) or DevOps tools (e.g. Git, Ansible, Jenkins…).

But this would only be a beginning. More in-depth integrations and features would create an operational user cockpit:
·        Data security: in order to debug and reproduce an issue, ideally you would like to retrieve all in- and outputs of each involved microservice. Unfortunately, these in- and outputs often contain confidential customer data, which requires the same level of security as the operational database. Solutions should therefore be found to store these in- and outputs in a secure (encrypted) database but be able to join them at run-time with the logging information, when desired. The monitoring tool should take care of this joining in the most secure way possible, i.e. enforcing security roles, auditing all data accesses, and scrambling personal data.
·        Proactive monitoring: most monitoring is still reactive monitoring - an issue occurs on the system, someone is alerted, and further investigation is done. This should evolve to a proactive monitoring, which identifies (potentially helped by an AI model) negative trends before they cause an incident. This can be very challenging as generating too many false positives can lead to a lot of lost time. Furthermore, as systems become more and more dynamic (continuous deployments, dynamic infrastructure, etc.​) and usage can also be peaky, it becomes more and more difficult to predict abnormal trends.
·        Engine to automatically trigger actions (easily configurable via rule engine), based on certain alerts. E.g., automatically provision extra servers/disks in case of CPU/disk space issues, automatically shut down services (i.e. Circuit Breaker pattern) consuming too many resources to avoid one failing service tearing down the whole system, and automatic rate limiting (throttling) in case of degrading performance due to overload.
·        Service Level Agreement management and monitoring: as indicated above, the monitoring platform should measure all KPIs and metrics, but ideally the tool should also be able to configure all SLAs, which the organisation (at different levels, such as business, application, and infrastructure) should honor. This way breaches of SLAs can be immediately alerted and SLA reporting can be easily generated.
·        The monitoring tool should also be able to identify fraud and security breaches (hacking attacks). Such activities will typically result in abnormal patterns in certain flows, which can be automatically identified by the monitoring platform. For example,. abnormal CPU activity could be a sign of cryptojacking or large data transfers to an undefined IP address could be an indication of data theft.
·        Integration with website and app analytics tools (like Google Analytics, Adobe Analytics and Google Firebase). These tools capture the key information of the usage of the website and collect info like number of visitors, where they’re coming from, and what pages of the website they are visiting. If this info can be automatically combined with the data available in the monitoring platform, much more detailed insights can be obtained. E.g., website analytics tools might indicate that users drop out in a sales funnel on step 3, while the integration with the monitoring tool could help analyse the cause of (part of) the drop-out, such as performance or availability issues.
·        Integration with chaos engineering (e.g. Netflix Simian Army or Gremlin): more and more companies are introducing chaos engineering, or resilience testing, directly into the production environment. These deliberately created incidents, on which the system should be able to respond in an automatic way, should be identified and filtered out by the monitoring system, to avoid investigating irrelevant issues. However, if the system did not respond correctly to an introduced fault, this should still be reported.
·        Integration with release management tools and CI/CD pipelines (e.g. GitLab, Jenkins, CircleCI, TeamCity, Bamboo, GoCD…): a deep integration of the monitoring platform with these tools can be incredibly useful to:
·       Identify (and filter out) any issues linked to a deployment, with known availability impacts.
·       Drill-down on a monitoring issue to see all related deployments and even the related source code commits.
·       Automatically rollback a deployment in case of regressions identified in the KPIs/metrics.
·       Automatically roll-forward in a gradual (canary) deployment, i.e., increase the target audience of new version.
·       Easy comparison of all metrics in an A/B deployment.
·       View historical graphs of usage/availability/performance, etc.,​ and be able to link it with relevant version changes. This can enable users to pin-point gradual degradations to a past code change.
·        Integration with a feature flag tool (e.g. Rollout, LaunchDarkly, Optimizely…), which allows users to activate/deactivate features based on different segmentations. Just like the integration with the release management tool and CI/CD pipeline, this integration would automatically rollback and roll-forward the activation of new features based on monitoring metrics.
·        Integration with defect management system (e.g. JIRA, ALM…), to automatically create (including the automatic attaching of logging and monitoring information to the defect) and close defects based on the observations in the monitoring tool.
·        Integration with chatbox and chatbots, allowing easy communication of alerts, but also allowing easy retrieval of monitoring information and investigation of incidents by chatting with the chatbot.
·        Integration with crash reporting tools, which collect all information at the end-user side (i.e., all device, browser, and application information) in case of a crash. This information should also be stored in the monitoring platform, so that this info can be linked to a trace and with all other features described above (like defect management system, and chatbox/chatbots​).
·        Easy exposure of business metrics via APIs or widgets (with possibilities for basic look-and-feel customizations), displayed via monitoring dashboards that are fully integrated in an end-user application.
·        Automatically generate workflow diagrams and documentation, based on the actual execution of processes. This allows users to get up-to-date documentation of all processes, based on actual flows, rather than a process analyst having to manually update the documentation of how a system is supposed to work (which might be different from the actual behaviour).

Most monitoring players, like e.g. Sensu (www.sensu.io), Datadog…, but also players in the DevOps space (e.g. GitLab), have already understood these needs, so heavy investments are already being made in this space. At my knowledge there is however no player in the market yet which offers all those functionalities. Some players provide multiple pre-built integration with different specialized tools, but as there is often an overlap in functionality and only a data integration (not an integration of front-end layer), this gives not yet the ideal user experience of a fully integrated cockpit.

I for one, am looking forward to a bright future, where being-on-call will no longer be such a burden. The monitoring platform would not only automatically resolve most standard issues, but in case a manual intervention is required, you will have all the information at the tip of your fingers. Having to wake up at 4 o’clock on a Saturday night will never be a dream, but let’s no longer make it a nightmare.

Comments

  1. This post is extremely radiant. I extremely like this post. It is outstanding amongst other posts that I’ve read in quite a while. Much obliged for this better than average post. I truly
    vat return service in barking

    ReplyDelete

Post a Comment

Popular posts from this blog

Transforming the insurance sector to an Open API Ecosystem

1. Introduction "Open" has recently become a new buzzword in the financial services industry, i.e.   open data, open APIs, Open Banking, Open Insurance …​, but what does this new buzzword really mean? "Open" refers to the capability of companies to expose their services to the outside world, so that   external partners or even competitors   can use these services to bring added value to their customers. This trend is made possible by the technological evolution of   open APIs (Application Programming Interfaces), which are the   digital ports making this communication possible. Together companies, interconnected through open APIs, form a true   API ecosystem , offering best-of-breed customer experience, by combining the digital services offered by multiple companies. In the   technology sector   this evolution has been ongoing for multiple years (think about the travelling sector, allowing you to book any hotel online). An excellent example of this

Are product silos in a bank inevitable?

Silo thinking   is often frowned upon in the industry. It is often a synonym for bureaucratic processes and politics and in almost every article describing the threats of new innovative Fintech players on the banking industry, the strong bank product silos are put forward as one of the main blockages why incumbent banks are not able to (quickly) react to the changing customer expectations. Customers want solutions to their problems   and do not want to be bothered about the internal organisation of their bank. Most banks are however organized by product domain (daily banking, investments and lending) and by customer segmentation (retail banking, private banking, SMEs and corporates). This division is reflected both at business and IT side and almost automatically leads to the creation of silos. It is however difficult to reorganize a bank without creating new silos or introducing other types of issues and inefficiencies. An organization is never ideal and needs to take a number of cons

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

Hypes and marketing buzz words are strongly present in the IT landscape. Often these are existing concepts, which have evolved technologically and are then renamed to a new term, as if it were a brand new technology or concept. If you want to understand and assess these new trends, it is important to   reduce the concepts to their essence and compare them with existing technologies , e.g. Integration (middleware) software   ensures that 2 separate applications or components can be integrated in an easy way. Of course, there is a huge evolution in the protocols, volumes of exchanged data, scalability, performance…​, but in essence the problem remains the same. Nonetheless, there have been multiple terms for integration software such as ETL, ESB, EAI, SOA, Service Mesh…​ Data storage software   ensures that data is stored in such a way that data is not lost and that there is some kind guaranteed consistency, maximum availability and scalability, easy retrieval and searching

IoT - Revolution or Evolution in the Financial Services Industry

1. The IoT hype We have all heard about the   "Internet of Things" (IoT)   as this revolutionary new technology, which will radically change our lives. But is it really such a revolution and will it really have an impact on the Financial Services Industry? To refresh our memory, the Internet of Things (IoT) refers to any   object , which is able to   collect data and communicate and share this information (like condition, geolocation…​)   over the internet . This communication will often occur between 2 objects (i.e. not involving any human), which is often referred to as Machine-to-Machine (M2M) communication. Well known examples are home thermostats, home security systems, fitness and health monitors, wearables…​ This all seems futuristic, but   smartphones, tablets and smartwatches   can also be considered as IoT devices. More importantly, beside these futuristic visions of IoT, the smartphone will most likely continue to be the center of the connected devi

Neobanks should find their niche to improve their profitability

The last 5 years dozens of so-called   neo- or challenger banks  (according to Exton Consulting 256 neobanks are in circulation today) have disrupted the banking landscape, by offering a fully digitized (cfr. "tech companies with a banking license"), very customer-centric, simple and fluent (e.g. possibility to become client and open an account in a few clicks) and low-cost product and service offering. While several of them are already valued at billions of euros (like Revolut, Monzo, Chime, N26, NuBank…​), very few of them are expected to be profitable in the coming years and even less are already profitable today (Accenture research shows that the average UK neobank loses $11 per user yearly). These challenger banks are typically confronted with increasing costs, while the margins generated per customer remain low (e.g. due to the offering of free products and services or above market-level saving account interest rates). While it’s obvious that disrupting the financial ma

PFM, BFM, Financial Butler, Financial Cockpit, Account Aggregator…​ - Will the cumbersome administrative tasks on your financials finally be taken over by your financial institution?

1. Introduction Personal Financial Management   (PFM) refers to the software that helps users manage their money (budget, save and spend money). Therefore, it is often also called   Digital Money Management . In other words, PFM tools   help customers make sense of their money , i.e. they help customers follow, classify, remain informed and manage their Personal Finances. Personal Finance   used to be (or still is) a time-consuming effort , where people would manually input all their income and expenses in a self-developed spreadsheet, which would gradually be extended with additional calculations. Already for more than 20 years,   several software vendors aim to give a solution to this , by providing applications, websites and/or apps. These tools were never massively adopted, since they still required a lot of manual interventions (manual input of income and expense transaction, manual mapping transactions to categories…​) and lacked an integration in the day-to-da

Can Augmented Reality make daily banking a more pleasant experience?

With the   increased competition in the financial services landscape (between banks/insurers, but also of new entrants like FinTechs and Telcos), customers are demanding and expecting a more innovative and fluent digital user experience. Unfortunately, most banks and insurers, with their product-oriented online and mobile platforms, are not known for their pleasant and fluent user experience. The   trend towards customer oriented services , like personal financial management (with functions like budget management, expense categorization, saving goals…​) and robo-advise, is already a big step in the right direction, but even then, managing financials is still considered to be a boring intangible and complex task for most people. Virtual (VR) and augmented reality (AR)   could bring a solution. These technologies provide a user experience which is   more intuitive, personalised and pleasant , as they introduce an element of   gamification   to the experience. Both VR and AR

Low- and No-code platforms - Will IT developers soon be out of a job?

“ The future of coding is no coding at all ” - Chris Wanstrath (CEO at GitHub). Mid May I posted a blog on RPA (Robotic Process Automation -   https://bankloch.blogspot.com/2020/05/rpa-miracle-solution-for-incumbent.html ) on how this technology, promises the world to companies. A very similar story is found with low- and no-code platforms, which also promise that business people, with limited to no knowledge of IT, can create complex business applications. These   platforms originate , just as RPA tools,   from the growing demand for IT developments , while IT cannot keep up with the available capacity. As a result, an enormous gap between IT teams and business demands is created, which is often filled by shadow-IT departments, which extend the IT workforce and create business tools in Excel, Access, WordPress…​ Unfortunately these tools built in shadow-IT departments arrive very soon at their limits, as they don’t support the required non-functional requirements (like high availabili

Beyond Imagination: The Rise and Evolution of Generative AI Tools

Generative AI   has revolutionized the way we create and interact with digital content. Since the launch of Dall-E in July 2022 and ChatGPT in November 2022, the field has seen unprecedented growth. This technology, initially popularized by OpenAI’s ChatGPT, has now been embraced by major tech players like Microsoft and Google, as well as a plethora of innovative startups. These advancements offer solutions for generating a diverse range of outputs including text, images, video, audio, and other media from simple prompts. The consumer now has a vast array of options based on their specific   output needs and use cases . From generic, large-scale, multi-modal models like OpenAI’s ChatGPT and Google’s Bard to specialized solutions tailored for specific use cases and sectors like finance and legal advice, the choices are vast and varied. For instance, in the financial sector, tools like BloombergGPT ( https://www.bloomberg.com/ ), FinGPT ( https://fin-gpt.org/ ), StockGPT ( https://www.as

From app to super-app to personal assistant

In July of this year,   KBC bank   (the 2nd largest bank in Belgium) surprised many people, including many of us working in the banking industry, with their announcement that they bought the rights to   broadcast the highlights of soccer matches   in Belgium via their mobile app (a service called "Goal alert"). The days following this announcement the news was filled with experts, some of them categorizing it as a brilliant move, others claiming that KBC should better focus on its core mission. Independent of whether it is a good or bad strategic decision (the future will tell), it is clearly part of a much larger strategy of KBC to   convert their banking app into a super-app (all-in-one app) . Today you can already buy mobility tickets and cinema tickets and use other third-party services (like Monizze, eBox, PayPal…​) within the KBC app. Furthermore, end of last year, KBC announced opening up their app also to non-customers allowing them to also use these third-party servi