OpenLLMetry data

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

OpenLLMetry data refers to the specific telemetry and observability information collected from applications that integrate Large Language Models (LLMs). It is an open-source extension of OpenTelemetry, designed to provide comprehensive insights into the performance, behavior, and interactions of LLMs within a larger software system. This data is crucial for debugging, optimizing, and ensuring the reliability and ethical alignment of LLM-powered applications.

Unlike general application observability, OpenLLMetry focuses on capturing metrics and traces that are unique to LLM operations, such as:

Prompt and completion data: The actual input prompts sent to the LLM and the generated responses (completions)
Token usage: The number of input and output tokens consumed by the LLM for each request, which is often directly related to cost and performance
Model attributes: Details about the specific LLM model used (for example, model name, version, temperature parameter)
Latency and response times: The time taken for LLM calls to complete
Error rates: Instances where LLM interactions fail or produce unexpected results
User feedback: Data collected from user interactions that can be used to improve model performance
Agent and tool monitoring: Insights into the workflows of autonomous systems and the tools they utilize

This data is standardized using the OpenTelemetry protocol, allowing it to be seamlessly integrated with the Splunk platform and providing a holistic view of the AI deployment stack.

Examples of OpenLLMetry data include:

Chatbot application performance: For each user query, OpenLLMetry captures the user's prompt, the LLM's generated response, the number of tokens used for both, and the latency of the LLM's processing. If the chatbot uses multiple LLMs or external tools (e.g., a knowledge base lookup), the data also includes traces of these interactions. This data helps identify if certain types of queries lead to higher token usage or longer response times, or if specific LLM models are underperforming. It can also highlight issues with prompt engineering or tool integration.
Content generation platform: When a user requests an article or summary, OpenLLMetry logs the specific parameters used for the generation (e.g., desired length, tone), the input prompt, the generated content, and the associated token count. It can also track different versions of prompts used over time. This data allows developers to compare the efficiency and quality of different prompt versions, optimize for token usage, and identify patterns in content generation that might indicate biases or errors.
Code generation assistant: For a code generation request, OpenLLMetry records the user's natural language description, the generated code snippet, the LLM model used, and the success/failure rate of the generated code when tested. This helps in understanding which types of code generation requests are most successful, which models perform best for specific programming languages, and where the LLM might be generating incorrect or inefficient code.
Vector database interactions in an RAG system: In a Retrieval-Augmented Generation (RAG) system, OpenLLMetry traces calls to vector databases (like Pinecone or Chroma), capturing the query sent to the database, the retrieved documents, and the latency of the database lookup. This data is vital for optimizing the retrieval phase of a RAG system, ensuring that relevant information is efficiently retrieved before being passed to the LLM for generation.
Sentiment analysis microservice: For each text input, OpenLLMetry captures the input text, the sentiment score or category predicted by the LLM, and any confidence scores. This data helps monitor the accuracy and consistency of the sentiment analysis over time, identify edge cases where the LLM struggles, and track the performance of the microservice itself.