Instrumenting LLM applications with OpenLLMetry and Splunk

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Most likely you've heard about ChatGPT, which is a popular chatbot developed by OpenAI. Unlike earlier chatbots you may have interacted with, ChatGPT is based on a large language model (LLM) and is able to answer questions using natural language, and the results are much better than you’d expect.

Given the availability of powerful new LLM models like GPT-4, organizations have started building LLM applications of their own to leverage generative AI capabilities to improve productivity for their employees and customers. In this article, we’ll demonstrate how OpenTelemetry and Splunk Observability Cloud can be used to ensure LLM applications are observable.

OpenLLMetry is described as “Open-source observability for your LLM application”. It includes OpenTelemetry instrumentation for the most popular LLM providers, including OpenAI, which we’ve used for our sample application in this article.

If you are familiar with LLMs, you can skip the next few sections of background information and go straight to the application portion of this article.

Definitions

Before going further, let’s take a moment to define some key concepts used in this article.

Generative AI

Generative AI is defined as “artificial intelligence (AI) that creates different types of content, such as text, images, audio, videos and 3D models.”

For more information on generative IA, see What is Generative AI? ChatGPT & Other AIs Transforming Creativity and Innovation.

Large Language Models (LLMs)

LLMs are a type of generative AI, and are defined as: “deep learning algorithms that can recognize, summarize, translate, predict, and generate content using very large datasets.”

For more information on large language models, see Large Language Models Explained.

LLM applications

LLM applications are built on top of LLM providers such as OpenAI’s GPT 4. For example, an organization could build a chatbot using the OpenAI Assistant and host it on their website to help customers with various questions and tasks.

Why is observability important for LLM applications?

LLM applications are similar to other applications in that they can experience slow performance or errors. These issues could be caused by different user input, infrastructure health, or issues with the downstream APIs that provide the LLM capabilities.

Just like any other application, observability is critical to ensure that LLMs remain performant and provide an excellent user experience.

LLM applications also have unique characteristics that need to be considered when building a comprehensive observability strategy. For example, monitoring API calls to the underlying LLM is highly important given that they are not free and can end up being a significant cost driver. Any observability solution should deliver visibility into these API calls, along with the tokens associated with them, in order to ensure actionable cost control.

What does an LLM application look like?

There are numerous LLM providers that can be used to build LLM applications. Some of the more popular ones include:

GPT 3.5 and GPT 4 from OpenAI
Claude v1 from Anthropic
Cohere

Roadmap for the remainder of this article

Now that we understand what Gen AI, LLM, and LLM applications are all about, let’s review the roadmap for the remainder of this article.

Running our sample LLM application. We’ll start by introducing a sample LLM application that we built for this article, walking through the process to run the application locally using Python and configuring it to use your OpenAI API key.
Mocking OpenAI API responses. We’ll then share a tip to mock the API calls to OpenAI, which helps to minimize costs during developing and testing.
How do we instrument an LLM application using OpenTelemetry? Next, we’ll show how to instrument the application with the Splunk Distribution of OpenTelemetry Python.
Viewing the data in Splunk Observability Cloud Then we'll send the resulting metrics and traces to Splunk Observability Cloud.
How do we improve the instrumentation of our LLM application? Following this, we’ll show how to enhance the OpenTelemetry instrumentation using OpenLLMetry, to provide further insight into the OpenAI API calls made by our application.
How can we track performance by user type? Finally, we’ll demonstrate how we can track the performance of calls to the OpenAI APIs that our application makes by the type of user that’s using our application.
Summary

Running our sample LLM application

For this article, we’ve built a simple Python application that uses GPT 3.5 Turbo from OpenAI. It includes a single endpoint that allows callers of the service to ask a question. Then the service uses the OpenAI API to ask the question and receive a response, using the GPT 3.5 Turbo model.

The initial source code includes the following content, and is stored in a file named app.py:

from openai import OpenAI
from flask import Flask, request

app = Flask(__name__)

client = OpenAI()

@app.route("/askquestion", methods=['POST'])
def ask_question():

   data = request.json
   user_type = data.get('userType')
   question = data.get('question')
   
   completion = client.chat.completions.create(
       model="gpt-3.5-turbo",
       messages=[
           {"role": "user", "content": question}
       ]
   )
   
   return completion.choices[0].message.content

To use the OpenAI API, we’ll need an API key, which you can get here. After we have the key, we’ll create an environment variable to store its value as follows:

export OPENAI_API_KEY='your-api-key-here'

We’ll then create a virtual environment to run the application:

python3 -m venv openai-env
source openai-env/bin/activate

And import the openai and flask modules:

pip3 install --upgrade openai
pip3 install flask

Finally, we can run the application as follows:

flask run -p 8080

To test the application, we’ll need to define a JSON file with the question and user type:

{
 "userType": "gold",
 "question":"Hello, world"
}

Next, we can test the application’s endpoint as follows:

curl -d "@question.json"  -H "Content-Type: application/json" -X POST 
http://localhost:8080/askquestion

It will respond with something like:

Hello! How can I assist you today?

Mocking OpenAI API responses

Because there’s a cost to use OpenAI’s API, we’ll use MockGPT for developing and testing our application. MockGPT simulates calls to the OpenAI API and provides mock responses.

Let’s update our application as follows to use Mock GPT. The updated code creates the client by passing in a Mock GPT API key and endpoint, rather than using the default OpenAI API key and endpoint:

import os
from openai import OpenAI
from flask import Flask, request

app = Flask(__name__)

# use this config for the Mock GPT API
client = OpenAI(
   api_key=os.environ.get("MOCK_GPT_API_KEY"),
   base_url="https://mockgpt.wiremockapi.cloud/v1"
)

@app.route("/askquestion", methods=['POST'])
def ask_question():
data = request.json
user_type = data.get('userType')
question = data.get('question')

completion = client.chat.completions.create(
   model="gpt-3.5-turbo",
   messages=[
       {"role": "user", "content": question}
   ]
)

return completion.choices[0].message.content

Before running this version of the application, we’ll need to provide our API key for MockGPT:

export MOCK_GPT_API_KEY='your-api-key-here'

Then we can run the update application as follows:

flask run -p 8080

Because the mock endpoint is invoked this time, it will respond differently:

Hello! 

This is the default MockGPT response. 

Create your own version in WireMock Cloud to fully customise this mock API.

Excellent. Now we can continue developing our application without worrying about cost.

How do we instrument an LLM application using OpenTelemetry?

Since this is a Python application, we can instrument it using the Splunk Distribution of OpenTelemetry Python, following the steps described in Instrument your Python application for Splunk Observability Cloud.

This section assumes that you have an OpenTelemetry collector already running on your host. If not, please refer to Get started: Understand and use the Collector to set one up.

We’ll start by installing the splunk-opentelemetry[all] module, which includes all of the dependencies we need to instrument our Python application:

pip install "splunk-opentelemetry[all]"

Then we’ll run the bootstrap script to install instrumentation for every supported package in our environment:

splunk-py-trace-bootstrap

We’ll want to then provide a name for our service, as well as the deployment environment. This is a best practice, as it makes it easy to find our service in Splunk Observability Cloud:

export OTEL_SERVICE_NAME=openai-test
export OTEL_RESOURCE_ATTRIBUTES='deployment.environment=test'

Now we can run our application as follows:

splunk-py-trace flask run -p 8080

Exercise the application a few times to generate some traffic using the same curl command as before.

curl -d "@question.json"  -H "Content-Type: application/json" -X POST 
http://localhost:8080/askquestion

Viewing the data in Splunk Observability Cloud

After a minute or so, we should see a service map for our application appear in Splunk Observability Cloud.

screenshot 1 - service map.png

In the service map, we can see that we’re connecting to the MockGPT endpoint (rather than the actual OpenAI endpoint), and how long those API calls are taking. We can also see that traces have been collected.

screenshot 2 - trace waterfall.png

Notice how we didn’t have to make any changes to capture these traces with OpenTelemetry?

These traces provide insight into how our application is performing, and specifically, how well calls to the OpenAI API (or in our case, calls to the MockGPT API) are performing.

This is a great start. In the next section we’ll show how to get a deeper level of instrumentation.

How do we improve the instrumentation of our LLM application?

To improve the instrumentation of our LLM application, we could use the OpenTelemetry SDK for Python to capture additional spans, span attributes, and metrics. This manual instrumentation approach provides full control over what observability data is collected from the application. But it requires a number of code changes, so for our example, we’re going to demonstrate how OpenLLMetry can be used instead.

We can add OpenAI instrumentation by installing the following module:

pip install opentelemetry-instrumentation-openai

We then invoke the OpenAI Instrumentor by adding the following code:

from opentelemetry.instrumentation.openai import OpenAIInstrumentor
app = Flask(__name__)
OpenAIInstrumentor().instrument()

Let’s run our application as before:

splunk-py-trace flask run -p 8080

And then exercise the application a few times to generate more traffic using the same curl command as before.

After a minute or so, we should see updated traces in Splunk Observability Cloud.

screenshot 3 - trace waterfall.png

On the surface, the trace looks almost identical to the ones we captured earlier. One notable difference is the second span, which refers to an endpoint named openai.chat. If we click on this span, we can see that a number of tags have been captured with this span.

screenshot 4 - trace tags.png

This demonstrates the power of using OpenLLMetry, which added a tremendous amount of context to our trace with only a single line of code.

Note that the question and response from the OpenAI endpoint were captured by the instrumentation by default. While this could be helpful for debugging issues, it should be disabled in production by setting the TRACELOOP_TRACE_CONTENT environment variable to false.

Scrolling down further, we can see that the number of tokens utilized for this request have been captured as well.

screenshot 5 - tags.png

In addition to capturing additional spans and span attributes, custom metrics have also been captured. For example, we can search in Metric Finder for a metric named llm.openai.chat_completions.tokens:

screenshot 6 - metrics.png

And here we can see that nine tokens were used for the “prompt” (meaning asking GPT 3.5 a question), and 12 tokens were used for the “completion” (meaning the response from GPT 3.5).

Since we’re using the mock interface, these token values will be the same for each call. But in a real application, it’s extremely important to see how many tokens are being used over time. It might even be helpful to see what users, or types of users, are consuming the most tokens, and how those tokens equate to cost. Let’s explore that in the next section.

How can we track performance by user type?

We’re already tracking the number of tokens utilized with each OpenAI API call. Let’s enhance our solution by also tracking the type of user associated with each request to our application’s endpoint. We can do this by updating the application code as follows to capture a new span attribute called “user.type”:

import os
from openai import OpenAI
from flask import Flask, request
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace

app = Flask(__name__)

OpenAIInstrumentor().instrument()

# use this config for the Mock GPT API
client = OpenAI(
   api_key=os.environ.get("MOCK_GPT_API_KEY"),
   base_url="https://mockgpt.wiremockapi.cloud/v1"
)
@app.route("/askquestion", methods=['POST'])
def ask_question():
   current_span = trace.get_current_span()  # <-- get a reference to the current span
   
   data = request.json
   user_type = data.get('userType')
   question = data.get('question')
   
   # track the type of user that makes each request
   current_span.set_attribute("user.type", user_type)
   
   completion = client.chat.completions.create(
       model="gpt-3.5-turbo",
       messages=[
           {"role": "user", "content": question}
       ]
   )
   
   return completion.choices[0].message.content

Let’s restart the application and generate some traffic as we did earlier. Now, when we look at the tags collected with the first span in the trace, we can see that the user type has been captured.

screenshot 7 - trace with user type.png

Let’s take this one step further, and create a Monitoring MetricSet (MMS) for the user.type tag.

screenshot 8 - MMS.png

Creating a Monitoring MetricSet allows us to access additional, powerful capabilities in Splunk Observability Cloud. To learn more about these capabilities, see Up Your Observability Game With Attributes.

For example, we can use Tag Spotlight to determine whether any particular types of users are getting a higher error rate or slower response times than others. In this case, we can see that bronze users are getting a higher error rate (with 100 percent of requests resulting in an error).

screenshot 9 - tag spotlight errors.png

We can also see that silver users have a slower response time on average than other user types.

screenshot 10 - tag spotlight latency.png

We can also use the breakdown feature on dynamic service maps to visually see the performance by user type.

screenshot 11 - service map breakdown.png

These powerful capabilities make it easy to understand exactly how our LLM application performs for each user type. In addition, the traces collected with OpenTelemetry provide us with detailed contextual information required to solve problems quickly.

Summary

In this article, we provided an overview of AI-related concepts such as Generative AI and Large Language Models (LLMs).

We then showed how a simple LLM application written in Python can be instrumented with OpenTelemetry, and how OpenLLMetry further enhances the instrumentation by capturing additional span attributes and metrics.

Finally, we showed how this data can be used in Splunk Observability Cloud, and how the powerful features in Splunk Observability Cloud can leverage these tags to understand exactly how our LLM application is performing and quickly solve issues when something goes wrong.

To get started instrumenting your own LLM application with OpenTelemetry and Splunk Observability Cloud today, see Instrument back-end applications to send spans to Splunk APM and select your desired language (Python, Node.js, etc.).

For more help, ask a Splunk Expert.