Observability refers to the ability to measure the internal state of a system based on the data it generates. This includes capturing metrics, logs and traces. While it is quite popular for engineering teams in terms of measuring their infrastructure & services, a new category has recently been created within it it due to the popularisation of LLMs. This specifically focuses on tools and packages that help measuring the cost, quality & security of LLM usage within organisations.
Fun Fact: As per the State of CI / CD report 2024 by CD Foundation, AI-assisted coding tools is the 5th most popular category of tools adopted by engineers, at par with IDEs!.
User experiences are evolving into more intuitive & personalised as the usage of LLMs is picking up pace across companies. Large corporations like Google, Meta were already offering these experiences in several of their customer & business products but with ChatGPT’s launch in 2022, more companies now have the tools to build these. Since then, the number of available LLMs, both closed and open source, have increased exponentially.
While the architecture around LLM usage still needs the conventional observability setup, since the LLM is mostly a separately deployed entity outside your code and accessed in a prompt + response model, it creates a need of qualitative observability. Every response, before being used, needs to be checked for hygiene and relevance. That means you need to set up logging of your LLM prompts & responses and then contextual analysis on top of them that fulfills your monitoring purpose.
Typically, the instrumentation for LLM Observability is done in 2 ways:
2. Proxy - Replace the LLM calls to a proxy layer which intercepts your calls and make those calls on your behalf. These help in controlling some trivial parameters without you worrying about them but could lead to application downtime in case the proxy is down.
Looking at the data collected from your LLM usage, you can do multiple optimisations to improve either the cost, quality, user behaviour or security of your product:
1. Cost optimisation for usage:
2. A/B Experimentation on Prompts / Prompt Engineering:
3. Evaluate the right LLM for your use-case: The below screenshot shows an output from empirical.run — an open source project shows the comparison between different models and their outputs for given prompts. Such comparisons are generally available within LLM Observability tools and help you accelerate your journey to identifying the best model for your use-case.
4. Improve RAGs: RAGs is Retrieval-Augmented Generation strategy that help leverage existing / private context to help the LLM answer questions better. An LLM Observability tool will also help you efficiently iterate over your context retrieval based queries, w.r.t prompts, chunking strategy and vector store format.
5. Fine-tuned model performances: Analyse how training the LLM with your data is affecting outputs and if there’s any significant different on the output quality between different versions of your model.
Langsmith is a product offering by Langchain, one of the fastest growing projects for LLM orchestration in the early days. It was launched in July 2023 and so far has over 100K users, one of the largest communities for a tool in the LLM space. They have also launched an LLM Observability platform, Langsmith.
While Langchain is open source, we could not find any Github repo for Langsmith apart from it’s SDKs. It offers only a cloud Saas solution with a free-tier offering of 5K traces month per month. Self-hosting is available only as add-on in Enterprise plan.
Langfuse is a leading observability tool for tracing, evaluation, logging and prompt management.
Langfuse is one of the most popular open-source LLM Observability platform as per their public usage statistics.
Cloud platform with freemium plans -- read about them here: https://langfuse.com/pricing
Github Repo: https://github.com/langfuse/langfuse
Website: https://langfuse.com/
Helicone is an Open Source LLM Observability startup, part of YCombinator W23 cohort.
- They have a generous free-tier of 50K monthly logs. - They also are open-source with MIT License.
It is available in open source with Apache 2.0 licence. Its free-tier allows 1K daily events only.
Arize is an ML Observability platform, catering for all ML / LLM models evaluation, observability and analytics.
Freemium model
(At Doctor Droid, we use Portkey) Portkey originally got popular for their Open Source LLM Gateway that was helping abstract 100+ LLM endpoints via one API. Post that, they started working on the LLM Observability tool, which is their current focus area.
Its free-tier offers 10K monthly requests.
Traceloop is another YC W23 batch startup that helps monitoring LLM models. Their SDK OpenLLMetry helps teams forward LLM Observability logs to 10+ different tools instead of staying locked to any one tool.
It offers a backend for receiving these traces as well. Free-tier offering includes 10K monthly traces. It is also open source with Apache 2.0 License.
It is only available in Python.
It is open source with MIT License. Cloud offering is not self-serveable.
Datadog is a popular infrastructure & application monitoring platform which has extended some of it’s integrations within the space of LLMs and related tooling. It’s interesting to see it’s out of the box dashboards for LLM Observability, yet it’s clear that there’s a lot more to be aspired from a brand like Datadog here.
Only supports OpenAI integrations currently with depth. There is only cloud option like the normal Datadog offering.
Pricing as per the Metrics / Traces usage within Datadog
Details on how the setup works - https://docs.datadoghq.com/integrations/openai/
While we tried to give you a detailed overview of some of the top tools in the category, here’s an additional list of tools:
Commercial
Like any other developer tool, there is a large overlap of features amongst these tools. This is what our take is: