Many organizations are changing their architecture from a
monolithic architecture to a microservice-based
architecture. Although this transformation gives many benefits, with regards to organisation, flexibility (due to
isolation) and development speed, it also comes with quite some challenges. Most of those challenges are linked to the increased orchestration complexities
introduced by microservices, which make monitoring and observability more difficult. The fact that the microservices are
developed in different technologies and deployed on an infrastructure of
multiple servers (often dynamically allocated by the container management
system to achieve elastic scalability and high availability) doesn’t help
either (and upcoming evolutions like Functions as a Service will make it even
worse).
Combine this with the fact that software development has
become more agile and more "try-and-adjust" oriented instead of the
first-time-right philosophy of a few years ago, monitoring every aspect of your
system is more crucial than ever. In the spirit of Peter Drucker’s teaching
"If you can’t measure it, you can’t
improve it," measuring and monitoring become the cornerstone of modern
product management.
Unfortunately, in many organisations monitoring is still very immature. Managers are often
astounded by the fact that a simple question of an end-user like "My order
was not executed" takes hours to analyse and often results in only a guess
about the explanation. This leads to a lot of frustration for all involved
parties and does not allow the organisation to become fault-tolerant, i.e. one
that continuously improves itself (when you cannot find the root cause of an
issue, it is impossible to resolve it).
Therefore, it comes as no surprise, that people are looking
for tools which
·
Reduce (and resolve) the monitoring and
observability challenges
·
Offer the visibility and understanding of the holistic application
architecture
·
Provide actionable
outputs in case of issues
Luckily technologies that provide this end-to-end view on each business process are continuously
improving, but nonetheless there is still a long way to go before arriving at a
solution that allows organizations to easily analyse and reproduce any issue. The
introduction of a technology like distributed tracing and monitoring has been a
huge step forward, but many problems remain, such as:
·
If the end-user experiencing the issue can provide the
unique trace ID, it is easy to analyse
the trace, but if not, more complex
searches on user ID or timestamp are required, which are far from
straightforward:
· Searching on user ID used to be common, but today the
concept of a user is becoming less and less clear, as applications are
multi-channel, used by many partners through APIs, and connected directly to
IoT devices.
· Searching on timestamp is also complex due to the large
number of service calls, which means that even a time span of a few minutes can
already include thousands of requests to search through.
·
When the cause of an issue has been found, the issue should
be reproduced in a test environment
in order to fix it. With a monolithic application, often a scrambled (removing
all personal and sensitive data) production back-up is restored on the test
environment after which the issue can easily be investigated, reproduced, and
retested (once the fix is implemented). With a microservice-based architecture
with many loose components, each having its own database (schema), this is a
completely different story.
·
The multitude of
different tools, often open-source (Kubernetes, Istio, Prometheus, Kibana,
Grafana, ElasticSearch, Zipkin…) but also vendor specific (Capterra lists 132
vendors in the space of Application Performance Measurement alone), each
providing a part of the monitoring puzzle (one tool for monitoring, another for
visualization, and yet another for metrics storage), makes it very hard to get
an end-to-end view. For example, service meshes like Istio can provide a common
solution, allowing to monitor/trace all ingoing/outgoing traffic, but in order
to get details from inside the microservice, developers need to instrument
their code directly.
In short, we can conclude that disaggregation introduced by microservices to optimize development
speed needs to be aggregated again to efficiently operate.
Ultimate, wouldn’t you dream of a single monitoring platform, which incorporates all aspects of
monitoring (logging, alerting, metrics/SLAs, dashboards…) in a single
consolidated user cockpit. Such a tool could proactively inform you about anomalies, allow the user to zoom in
and out on any every event occurring in the company and resolve the operational
complexities introduced by microservices.
Such a platform would provide basic features like:
·
Distributed tracing, like the ability to follow a
request throughout its lifecycle
·
Aggregation of log files, with the ability to quickly search
inside log files
·
Dashboarding:
· Providing different views (like performance, availability, data flows),
allowing to answer questions like which queries have the slowest response
times, which URLs are seeing the most errors, which services are using the most
CPU, and where the bottlenecks with regards to availability are
· Allowing to easily drill down from the logical application level into physical
containers (i.e., from business monitoring to application monitoring to
technical/infrastructure monitoring). This can be especially challenging as
containers have shorter and shorter lifetimes inside the container
orchestration system.
· On-the-fly aggregation, making it possible to aggregate
information from all the containers associated with a function or a service.
For example, being able to determine the total disk usage associated with a
service and see the impact of downtime of a container on the overall business
service.
·
KPIs/Metrics: the platform should calculate
KPIs/metrics at different levels (from business up to infrastructure level),
with a focus on business-level metrics but with the ability to drill down to
lower-level metrics.
·
Alerting: the ability to generate alerts in
case of specific errors or certain KPIs passing out of boundaries, with the
possibility to:
· Drill down on an alert from user/business level to
application level, to container level up to infrastructure level.
· Link runbooks to each alert
· Automatically derive the business impact of a lower-level alert, i.e.,
abstract a lower-level alert to higher-level, making it so you can immediately
explain what the business impacts of a lower-level issue are (e.g., what is the
negative impact of a disk space issue on the business transactions).
Many platforms like e.g. Sensu (www.sensu.io) and Datadog already provide most
of these functionalities and provide predefined integrations with specialized
tools such as metrics analytics tools (e.g. Elasticsearch, Splunk…), incident
tools (e.g. PagerDuty, ServiceNow, VictorOps…) or DevOps tools (e.g. Git, Ansible,
Jenkins…).
But this would only be a beginning. More in-depth
integrations and features would create an operational
user cockpit:
·
Data security: in order to debug and reproduce an
issue, ideally you would like to retrieve all in- and outputs of each involved
microservice. Unfortunately, these in- and outputs often contain confidential
customer data, which requires the same level of security as the operational
database. Solutions should therefore be found to store these in- and outputs in
a secure (encrypted) database but be able to join them at run-time with the
logging information, when desired. The monitoring tool should take care of this
joining in the most secure way possible, i.e. enforcing security roles,
auditing all data accesses, and scrambling personal data.
·
Proactive monitoring: most monitoring is still reactive
monitoring - an issue occurs on the system, someone is alerted, and further
investigation is done. This should evolve to a proactive monitoring, which
identifies (potentially helped by an AI model) negative trends before they
cause an incident. This can be very challenging as generating too many false
positives can lead to a lot of lost time. Furthermore, as systems become more
and more dynamic (continuous deployments, dynamic infrastructure, etc.) and
usage can also be peaky, it becomes more and more difficult to predict abnormal
trends.
·
Engine to automatically
trigger actions (easily configurable via rule engine), based on certain
alerts. E.g., automatically provision extra servers/disks in case of CPU/disk
space issues, automatically shut down services (i.e. Circuit Breaker pattern)
consuming too many resources to avoid one failing service tearing down the
whole system, and automatic rate limiting (throttling) in case of degrading
performance due to overload.
·
Service Level Agreement management
and monitoring:
as indicated above, the monitoring platform should measure all KPIs and
metrics, but ideally the tool should also be able to configure all SLAs, which
the organisation (at different levels, such as business, application, and
infrastructure) should honor. This way breaches of SLAs can be immediately
alerted and SLA reporting can be easily generated.
·
The monitoring tool should also be able to identify fraud and security breaches
(hacking attacks). Such activities will typically result in abnormal patterns
in certain flows, which can be automatically identified by the monitoring
platform. For example,. abnormal CPU activity could be a sign of cryptojacking
or large data transfers to an undefined IP address could be an indication of
data theft.
·
Integration with website
and app analytics tools (like Google Analytics, Adobe Analytics and Google
Firebase). These tools capture the key information of the usage of the website
and collect info like number of visitors, where they’re coming from, and what
pages of the website they are visiting. If this info can be automatically
combined with the data available in the monitoring platform, much more detailed
insights can be obtained. E.g., website analytics tools might indicate that
users drop out in a sales funnel on step 3, while the integration with the
monitoring tool could help analyse the cause of (part of) the drop-out, such as
performance or availability issues.
·
Integration with chaos
engineering (e.g. Netflix Simian
Army or Gremlin): more and more companies are introducing chaos
engineering, or resilience testing, directly into the production environment.
These deliberately created incidents, on which the system should be able to
respond in an automatic way, should be identified and filtered out by the
monitoring system, to avoid investigating irrelevant issues. However, if the
system did not respond correctly to an introduced fault, this should still be
reported.
·
Integration with release
management tools and CI/CD pipelines
(e.g. GitLab, Jenkins, CircleCI, TeamCity, Bamboo, GoCD…): a deep
integration of the monitoring platform with these tools can be incredibly
useful to:
· Identify (and filter out) any issues linked to a deployment, with
known availability impacts.
· Drill-down on a monitoring issue to see all related deployments and even the related source code commits.
· Automatically rollback a deployment in case of regressions identified in the
KPIs/metrics.
· Automatically roll-forward in a gradual (canary) deployment, i.e., increase the
target audience of new version.
· Easy comparison of all metrics in an A/B deployment.
· View historical graphs of usage/availability/performance,
etc., and be able to link it with relevant version changes. This can enable
users to pin-point gradual degradations to a past code change.
·
Integration with a feature
flag tool (e.g. Rollout, LaunchDarkly, Optimizely…), which allows users to
activate/deactivate features based on different segmentations. Just like the
integration with the release management tool and CI/CD pipeline, this
integration would automatically rollback and roll-forward the activation of new
features based on monitoring metrics.
·
Integration with defect
management system (e.g. JIRA, ALM…),
to automatically create (including the automatic attaching of logging and
monitoring information to the defect) and close defects based on the
observations in the monitoring tool.
·
Integration with chatbox
and chatbots, allowing easy communication of alerts, but also allowing easy
retrieval of monitoring information and investigation of incidents by chatting
with the chatbot.
·
Integration with crash
reporting tools, which collect all information at the end-user side (i.e.,
all device, browser, and application information) in case of a crash. This
information should also be stored in the monitoring platform, so that this info
can be linked to a trace and with all other features described above (like
defect management system, and chatbox/chatbots).
·
Easy exposure of
business metrics via APIs or widgets (with possibilities for basic
look-and-feel customizations), displayed via monitoring dashboards that are
fully integrated in an end-user application.
·
Automatically generate
workflow diagrams and documentation, based on the actual execution of
processes. This allows users to get up-to-date documentation of all processes,
based on actual flows, rather than a process analyst having to manually update
the documentation of how a system is supposed to work (which might be different
from the actual behaviour).
Most monitoring players, like e.g. Sensu (www.sensu.io), Datadog…, but also players in the
DevOps space (e.g. GitLab), have already understood these needs, so heavy
investments are already being made in this space. At my knowledge there is
however no player in the market yet which offers all those functionalities.
Some players provide multiple pre-built integration with different specialized
tools, but as there is often an overlap in functionality and only a data
integration (not an integration of front-end layer), this gives not yet the
ideal user experience of a fully integrated cockpit.
I for one, am looking forward to a bright future, where being-on-call will no longer be such a burden.
The monitoring platform would not only automatically resolve most standard
issues, but in case a manual intervention is required, you will have all the
information at the tip of your fingers. Having to wake up at 4 o’clock on a
Saturday night will never be a dream, but let’s no longer make it a nightmare.
This post is extremely radiant. I extremely like this post. It is outstanding amongst other posts that I’ve read in quite a while. Much obliged for this better than average post. I truly
ReplyDeletevat return service in barking