The Observable Edition

Jul 09, 2024

Welcome to the latest edition of AWSCQ, and as you may have already guessed, this one is going to be all about observability.

This edition is written by Mark Faiers, with contributions from Muhammed Kaab, RiverSafe’s resident expert on all things observability.

Mark is a former AWS APN Ambassador, re:Invent speaker, and haver of opinions who has worked on strategic Cloud and DevOps projects with many of the biggest household names in the U.K. and beyond. He is currently heading up the DevOps practice at RiverSafe, a growing consultancy who specialise in DevOps, AppSec, and Cyber.

Observability is a critical aspect of understanding and managing modern software systems, particularly in the context of the cloud-native ecosystem. It involves the ability to infer internal states of your systems based on the external outputs. Observability hinges on three key types of telemetry data: logs, metrics, and traces. Logs are detailed, timestamped records of events that occur within a system, capturing everything from errors to user actions, useful for post-mortem analysis. Metrics are numerical values measuring system performance over time, such as CPU usage or request latency, and are essential for monitoring and alerting on trends and patterns. Traces provide an end-to-end view of requests as they travel through various services, helping to identify performance bottlenecks and diagnose latency issues.

By leveraging these three components, organizations gain a comprehensive view of their system's health and performance. This integrated approach allows for quick identification and resolution of issues, ensuring optimal reliability and efficiency. For a more in-depth discussion of what Observability is you can also refer to this excellent Splunk Article – What is Observability?

Due to the explosion in the use of technology, the emergence of cloud, and the adoption of microservice architectures the world of Observability over the last 20 years has gone from after the fact reports and nice-to-have dashboards that no one really uses, to tools and systems that stream, analyse, visualise, and react to gigabytes or terabytes of data in real-time or near real-time. In other words, the usefulness of Observability tools has increased exponentially, and this has grown out of necessity.

So that covers the ‘Why?’ of observability, but how about the ‘What?’, or more precisely ‘What’s new?’

AWS Updates

AWS has made a number of interesting releases in the Observability space recently.

Firstly, they have provided a solution for gaining greater insight into your EKS Clusters with a managed solution that utilises Amazon Managed Grafana (AMG) and Amazon Managed Service for Prometheus that can help you to better understand cluster, container, workload, and control plane health and issues. The solution is deployable as a CDK application and provides a number of dashboards to give you that visual view of anything and everything in your EKS cluster/s. Container observability. Nice!

Next, is the Alarm Context Tool, which uses GenAI (fancy that!) LLM models available through Amazon Bedrock to analyse data that is already being produced by other AWS services, such as X-Ray, and CloudWatch to provide summaries of issues, help with root cause analysis, and aid in troubleshooting problems. I definitely could have saved a lot of time over the years with such a tool! Check out the intro blog here.

And finally, and perhaps most boringly (I kid, version support is important!) AMG now supports Grafana version 10.4, which includes a number of usability updates and improvements that you can read about here.

Other Updates

Speaking of Grafana, they have recently been very focussed on application observability. And in that vein have introduced new features to their Application Observability offering. This includes time period comparison, allowing you to visualise two different time periods overlayed on the same graph, automatic baselining to help identify deviations from normal application behaviour, and more in-context navigation, allowing more richness of exploration when investigating issues. Read all about it here (and bonus points for the great short video demos!)

Microsoft recently held their annual developer conference ‘Build’ in which there were a few Observability updates, including:

Azure Monitor capabilities now available at the edge. Azure have an ambition to be “The Most Observable Cloud” by rearchitecting around Open Telemetry (OTEL) and now it seems they want to be the most observable edge (doesn’t quite have the same ring to it, does it?) too.
The ability for Copilot to analyse Azure Monitor logs. This one is kind of a no brainer. Copilot will get greater adoption through having more integrations, and logs are basically masses and masses of data that it can be difficult to sort and search through without familiarity with a particular product or specialised SQL-like language.
Improved AKS observability using Azure Monitor managed service for Prometheus in Public Preview. Again, anything that helps with Container/Kubernetes observability is very welcome.

Check out all of those updates and more here.

In M&A news CloudFlare have gotten themselves into the Observability game with their acquisition of Baselime, a Cloud Observability platform. As have SUSE, who have acquired StackState.

And finally, Honeycomb.io are looking at that often forgotten but nonetheless important part of the application jigsaw, the front end. The have announced an early access program for teams to get insights into how they can best improve UX. Read all about it here.

GenAI

What is a blog, newsletter, conference, or conversation with a family member (just me?) in 2024 if Generative AI doesn’t come up somewhere!?

Now don’t get more wrong, there is a whole lot of hype, with everyone suddenly becoming an expert ‘Prompt Engineer’ and organisations scrambling to build ‘GenAI’ into their offerings somehow! anyhow!! But underneath that there is a useful tool with some interesting use cases:

Splunk have built a GenAI assistant that helps you write SPL (Search Processing Language) using natural language. Which is nice because this can be a daunting challenge for those new to Splunk.
DataDog have ‘BitsAI’ a chat style interface that can be used to get insights into applications, and issues and can help write queries to get you exactly the data you are looking for.
And Grafana have Auto-Incident Summary which, well, generates a summary of the causes, and resolution of an incident. Very handy for when your boss asks you for a report!

These are all, of course, implementations of GenAI to improve observability. But what about the observability of GenAI itself. This might be the most interesting, and least discussed topic of all.

In the context of GenAI, observability plays a crucial role in understanding the behaviour and performance of AI models and systems. It can provide insights into the training, validation, and inference stages of model development, thereby helping detect and diagnose issues early. The problem is that organisations want to build GenAI into their products and offerings without truly understanding it in a lot of cases. Of course, observability is only one piece of the puzzle here but is essential if you want to ensure consistency and high quality.

There are now specialised tools and frameworks cropping up in the space, such as OpenLit. And the nice folk at Open Telemetry have recently been looking at the how, what, and why of this tool and what good looks like in LLM Observability. Check it out here. And also what DataDog is doing in the space here.

OpenTelemetry and its Evolution

We’ve discussed OpenTelemetry (OTEL) quite a bit in the newsletter so I though it would be good to go into a bit of detail, for those who aren’t familiar, about what it is and where it came from.

OpenTelemetry, or OTEL for short, provides a standardized way to collect, process, and export telemetry data, including traces, metrics, and logs. It simplifies the integration of observability tools and promotes interoperability in observability tooling. It operates with a standardized data format and APIs that simplify the integration of observability tools, enabling organizations have a single observability stack and therefore gain better insights into their applications, while reducing context switching.

The OTEL standards came into existence as a result of the merger between two other observability projects, OpenTracing by Uber and OpenCensus, from Google. These projects had the same goal of simplifying observability in modern software systems and reducing the fragmentation in the observability landscape. OpenTelemetry combined the best features of both projects and created a unified observability framework for cloud-native software, adding new features where they didn’t exist in either project.

OTEL now provides APIs, libraries, and agents to generate and collect telemetry data including metrics, logs, and traces. It has flexibility and extensibility built in and with a growing ecosystem of exporters and receivers and integrates with a wide range of observability platforms, allowing organizations to choose the tools that best fit their needs. This level of interoperability reduces vendor lock-in and fosters innovation within the observability landscape. OTEL has, over time become the standard, with the likes of AWS created their own distribution to enable easy integration with their own Observability stack. And Azure, as we’ve already seen have re-architected their main monitoring solution, Monitor, around OTEL.

If you want to find out more about OTEL, and what is coming next you can find their roadmap here.

And that’s a wrap, come back next time to hear about all the latest goings-on in the AWS world and beyond. Stay magical!

A huge thanks to Mark for this fantastic edition for AWSCQ.

We’ll have another issue and Guest editor in the coming weeks.

Before we go though…

AWS Community Summit is back in Manchester this September.

We’ve another packed agenda across three tracks which you can check out now.

Early bird tickets end on 20th July so hit the button below and see what we have in store.

COMSUM 2024