Microservices monitoring: disaggregate to optimize, aggregate to operate

Many organizations are changing their architecture from a monolithic architecture to a microservice-based architecture. Although this transformation gives many benefits, with regards to organisation, flexibility (due to isolation) and development speed, it also comes with quite some challenges. Most of those challenges are linked to the increased orchestration complexities introduced by microservices, which make monitoring and observability more difficult. The fact that the microservices are developed in different technologies and deployed on an infrastructure of multiple servers (often dynamically allocated by the container management system to achieve elastic scalability and high availability) doesn’t help either (and upcoming evolutions like Functions as a Service will make it even worse).

Combine this with the fact that software development has become more agile and more "try-and-adjust" oriented instead of the first-time-right philosophy of a few years ago, monitoring every aspect of your system is more crucial than ever. In the spirit of Peter Drucker’s teaching "If you can’t measure it, you can’t improve it," measuring and monitoring become the cornerstone of modern product management.

Unfortunately, in many organisations monitoring is still very immature. Managers are often astounded by the fact that a simple question of an end-user like "My order was not executed" takes hours to analyse and often results in only a guess about the explanation. This leads to a lot of frustration for all involved parties and does not allow the organisation to become fault-tolerant, i.e. one that continuously improves itself (when you cannot find the root cause of an issue, it is impossible to resolve it).

Therefore, it comes as no surprise, that people are looking for tools which

· Reduce (and resolve) the monitoring and observability challenges

· Offer the visibility and understanding of the holistic application architecture

· Provide actionable outputs in case of issues

Luckily technologies that provide this end-to-end view on each business process are continuously improving, but nonetheless there is still a long way to go before arriving at a solution that allows organizations to easily analyse and reproduce any issue. The introduction of a technology like distributed tracing and monitoring has been a huge step forward, but many problems remain, such as:

· If the end-user experiencing the issue can provide the unique trace ID, it is easy to analyse the trace, but if not, more complex searches on user ID or timestamp are required, which are far from straightforward:

· Searching on user ID used to be common, but today the concept of a user is becoming less and less clear, as applications are multi-channel, used by many partners through APIs, and connected directly to IoT devices.

· Searching on timestamp is also complex due to the large number of service calls, which means that even a time span of a few minutes can already include thousands of requests to search through.

· When the cause of an issue has been found, the issue should be reproduced in a test environment in order to fix it. With a monolithic application, often a scrambled (removing all personal and sensitive data) production back-up is restored on the test environment after which the issue can easily be investigated, reproduced, and retested (once the fix is implemented). With a microservice-based architecture with many loose components, each having its own database (schema), this is a completely different story.

· The multitude of different tools, often open-source (Kubernetes, Istio, Prometheus, Kibana, Grafana, ElasticSearch, Zipkin…) but also vendor specific (Capterra lists 132 vendors in the space of Application Performance Measurement alone), each providing a part of the monitoring puzzle (one tool for monitoring, another for visualization, and yet another for metrics storage), makes it very hard to get an end-to-end view. For example, service meshes like Istio can provide a common solution, allowing to monitor/trace all ingoing/outgoing traffic, but in order to get details from inside the microservice, developers need to instrument their code directly.

In short, we can conclude that disaggregation introduced by microservices to optimize development speed needs to be aggregated again to efficiently operate.

Ultimate, wouldn’t you dream of a single monitoring platform, which incorporates all aspects of monitoring (logging, alerting, metrics/SLAs, dashboards…) in a single consolidated user cockpit. Such a tool could proactively inform you about anomalies, allow the user to zoom in and out on any every event occurring in the company and resolve the operational complexities introduced by microservices.

Such a platform would provide basic features like:

· Distributed tracing, like the ability to follow a request throughout its lifecycle

· Aggregation of log files, with the ability to quickly search inside log files

· Dashboarding:

· Providing different views (like performance, availability, data flows), allowing to answer questions like which queries have the slowest response times, which URLs are seeing the most errors, which services are using the most CPU, and where the bottlenecks with regards to availability are

· Allowing to easily drill down from the logical application level into physical containers (i.e., from business monitoring to application monitoring to technical/infrastructure monitoring). This can be especially challenging as containers have shorter and shorter lifetimes inside the container orchestration system.

· On-the-fly aggregation, making it possible to aggregate information from all the containers associated with a function or a service. For example, being able to determine the total disk usage associated with a service and see the impact of downtime of a container on the overall business service.

· KPIs/Metrics: the platform should calculate KPIs/metrics at different levels (from business up to infrastructure level), with a focus on business-level metrics but with the ability to drill down to lower-level metrics.

· Alerting: the ability to generate alerts in case of specific errors or certain KPIs passing out of boundaries, with the possibility to:

· Drill down on an alert from user/business level to application level, to container level up to infrastructure level.

· Link runbooks to each alert

· Automatically derive the business impact of a lower-level alert, i.e., abstract a lower-level alert to higher-level, making it so you can immediately explain what the business impacts of a lower-level issue are (e.g., what is the negative impact of a disk space issue on the business transactions).

Many platforms like e.g. Sensu (www.sensu.io) and Datadog already provide most of these functionalities and provide predefined integrations with specialized tools such as metrics analytics tools (e.g. Elasticsearch, Splunk…), incident tools (e.g. PagerDuty, ServiceNow, VictorOps…) or DevOps tools (e.g. Git, Ansible, Jenkins…).

But this would only be a beginning. More in-depth integrations and features would create an operational user cockpit:

· Data security: in order to debug and reproduce an issue, ideally you would like to retrieve all in- and outputs of each involved microservice. Unfortunately, these in- and outputs often contain confidential customer data, which requires the same level of security as the operational database. Solutions should therefore be found to store these in- and outputs in a secure (encrypted) database but be able to join them at run-time with the logging information, when desired. The monitoring tool should take care of this joining in the most secure way possible, i.e. enforcing security roles, auditing all data accesses, and scrambling personal data.

· Proactive monitoring: most monitoring is still reactive monitoring - an issue occurs on the system, someone is alerted, and further investigation is done. This should evolve to a proactive monitoring, which identifies (potentially helped by an AI model) negative trends before they cause an incident. This can be very challenging as generating too many false positives can lead to a lot of lost time. Furthermore, as systems become more and more dynamic (continuous deployments, dynamic infrastructure, etc.) and usage can also be peaky, it becomes more and more difficult to predict abnormal trends.

· Engine to automatically trigger actions (easily configurable via rule engine), based on certain alerts. E.g., automatically provision extra servers/disks in case of CPU/disk space issues, automatically shut down services (i.e. Circuit Breaker pattern) consuming too many resources to avoid one failing service tearing down the whole system, and automatic rate limiting (throttling) in case of degrading performance due to overload.

· Service Level Agreement management and monitoring: as indicated above, the monitoring platform should measure all KPIs and metrics, but ideally the tool should also be able to configure all SLAs, which the organisation (at different levels, such as business, application, and infrastructure) should honor. This way breaches of SLAs can be immediately alerted and SLA reporting can be easily generated.

· The monitoring tool should also be able to identify fraud and security breaches (hacking attacks). Such activities will typically result in abnormal patterns in certain flows, which can be automatically identified by the monitoring platform. For example,. abnormal CPU activity could be a sign of cryptojacking or large data transfers to an undefined IP address could be an indication of data theft.

· Integration with website and app analytics tools (like Google Analytics, Adobe Analytics and Google Firebase). These tools capture the key information of the usage of the website and collect info like number of visitors, where they’re coming from, and what pages of the website they are visiting. If this info can be automatically combined with the data available in the monitoring platform, much more detailed insights can be obtained. E.g., website analytics tools might indicate that users drop out in a sales funnel on step 3, while the integration with the monitoring tool could help analyse the cause of (part of) the drop-out, such as performance or availability issues.

· Integration with chaos engineering (e.g. Netflix Simian Army or Gremlin): more and more companies are introducing chaos engineering, or resilience testing, directly into the production environment. These deliberately created incidents, on which the system should be able to respond in an automatic way, should be identified and filtered out by the monitoring system, to avoid investigating irrelevant issues. However, if the system did not respond correctly to an introduced fault, this should still be reported.

· Integration with release management tools and CI/CD pipelines (e.g. GitLab, Jenkins, CircleCI, TeamCity, Bamboo, GoCD…): a deep integration of the monitoring platform with these tools can be incredibly useful to:

· Identify (and filter out) any issues linked to a deployment, with known availability impacts.

· Drill-down on a monitoring issue to see all related deployments and even the related source code commits.

· Automatically rollback a deployment in case of regressions identified in the KPIs/metrics.

· Automatically roll-forward in a gradual (canary) deployment, i.e., increase the target audience of new version.

· Easy comparison of all metrics in an A/B deployment.

· View historical graphs of usage/availability/performance, etc., and be able to link it with relevant version changes. This can enable users to pin-point gradual degradations to a past code change.

· Integration with a feature flag tool (e.g. Rollout, LaunchDarkly, Optimizely…), which allows users to activate/deactivate features based on different segmentations. Just like the integration with the release management tool and CI/CD pipeline, this integration would automatically rollback and roll-forward the activation of new features based on monitoring metrics.

· Integration with defect management system (e.g. JIRA, ALM…), to automatically create (including the automatic attaching of logging and monitoring information to the defect) and close defects based on the observations in the monitoring tool.

· Integration with chatbox and chatbots, allowing easy communication of alerts, but also allowing easy retrieval of monitoring information and investigation of incidents by chatting with the chatbot.

· Integration with crash reporting tools, which collect all information at the end-user side (i.e., all device, browser, and application information) in case of a crash. This information should also be stored in the monitoring platform, so that this info can be linked to a trace and with all other features described above (like defect management system, and chatbox/chatbots).

· Easy exposure of business metrics via APIs or widgets (with possibilities for basic look-and-feel customizations), displayed via monitoring dashboards that are fully integrated in an end-user application.

· Automatically generate workflow diagrams and documentation, based on the actual execution of processes. This allows users to get up-to-date documentation of all processes, based on actual flows, rather than a process analyst having to manually update the documentation of how a system is supposed to work (which might be different from the actual behaviour).

Most monitoring players, like e.g. Sensu (www.sensu.io), Datadog…, but also players in the DevOps space (e.g. GitLab), have already understood these needs, so heavy investments are already being made in this space. At my knowledge there is however no player in the market yet which offers all those functionalities. Some players provide multiple pre-built integration with different specialized tools, but as there is often an overlap in functionality and only a data integration (not an integration of front-end layer), this gives not yet the ideal user experience of a fully integrated cockpit.

I for one, am looking forward to a bright future, where being-on-call will no longer be such a burden. The monitoring platform would not only automatically resolve most standard issues, but in case a manual intervention is required, you will have all the information at the tip of your fingers. Having to wake up at 4 o’clock on a Saturday night will never be a dream, but let’s no longer make it a nightmare.

Comments

Aaaaccounting28 February 2022 at 10:45
This post is extremely radiant. I extremely like this post. It is outstanding amongst other posts that I’ve read in quite a while. Much obliged for this better than average post. I truly
vat return service in barking
ReplyDelete
Replies

Add comment

Bankloch

Search This Blog

Microservices monitoring: disaggregate to optimize, aggregate to operate

Labels

Comments

Post a Comment

Popular posts from this blog

Transforming the insurance sector to an Open API Ecosystem

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

IoT - Revolution or Evolution in the Financial Services Industry

AI in Financial Services - A buzzword that is here to stay!

A bank account - A concept of the past

An overview of 1-year blogging

Peer-to-peer payments - A crucial component towards a cashless society

From app to super-app to personal assistant

Neobanks should find their niche to improve their profitability

Marketplaces in the financial industry - Here to stay?