Monitoring Gen AI apps with NVIDIA GPUs

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Organizations are increasingly building applications that leverage large language models (LLMs) and Gen AI to solve real business problems. Some characteristics of these applications include the following:

Many are written using Python, and they use commercial LLM models such as OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Haiku.
They might also utilize open-source models such as Meta’s Llama.
They run on Nvidia GPU hardware, typically in Kubernetes.
They also use frameworks such as Langchain to abstract the details of an LLM’s API.
And they use vector databases such as Pinecone and Chroma to store embeddings for Retrieval Augmented Generation (RAG).

Just like any other application, observability is critical to ensure that LLMs remain performant and provide an optimal user experience. In this article, we’ll outline the steps you can take to start monitoring Gen AI applications with Splunk Observability Cloud.

Observability and generative AI

OpenTelemetry is quickly gaining traction in the Gen AI space. Several open-source projects have instrumented the most common Gen AI application components with OpenTelemetry:

Splunk Observability Cloud is OpenTelemetry native, which makes it the perfect choice for monitoring Gen AI applications.

Let’s walk through how you can add observability to your own Gen AI applications with Splunk Observability Cloud.

Step 1: Get access to Splunk Observability Cloud

To begin, you’ll need a Splunk Observability Cloud organization to send your metric and trace data to. If you’re not currently a Splunk Observability Cloud customer, you can sign-up for a free 14-day trial here; no credit card is required.

Step 2: Deploy the Splunk Distribution of the OpenTelemetry Collector

You can install the Splunk Distribution of the OpenTelemetry Collector for Kubernetes using a Helm chart to define, install, and upgrade Kubernetes applications, as explained here.

The command to install the collector using Helm will look something like the following:

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart

helm repo update

helm install splunk-otel-collector \
 --set="splunkObservability.realm=$REALM" \
 --set="splunkObservability.accessToken=$ACCESS_TOKEN" \
 --set="splunkObservability.profilingEnabled=true" \
 --set="clusterName=$CLUSTER_NAME" \
 --set="environment=$ENVIRONMENT" \
 --set="splunkPlatform.token=$HEC_TOKEN" \
 --set="splunkPlatform.endpoint=$HEC_URL" \
 --set="splunkPlatform.index=$INDEX" \
 -f ./values.yaml \
 splunk-otel-collector-chart/splunk-otel-collector

Note that the above command references a file named values.yaml, which is used to further customize the configuration of the collector. It also uses environment variables to define settings such as the access token and HEC URL.

After the collector is running, you should see data coming in when you navigate to Infrastructure -> Kubernetes -> Kubernetes Cluster and then search for your cluster name.

K8s Cluster.png

Step 3: Capture NVIDIA GPU metrics

The NVIDIA GPU Operator is typically deployed in Kubernetes clusters with NVIDIA GPU hardware to simplify the process of making this specialized hardware available to pods that require it.

The operator includes a /metrics endpoint that you can scrape with the Prometheus receiver running in your OpenTelemetry Collector to capture metrics.

The Prometheus receiver can be added by using a values.yaml file like the one found here.

Apply the changes with a command such as the following:

helm upgrade splunk-otel-collector \
 --set="splunkObservability.realm=$REALM" \
 --set="splunkObservability.accessToken=$ACCESS_TOKEN" \
 --set="splunkObservability.profilingEnabled=true" \
 --set="clusterName=$CLUSTER_NAME" \
 --set="environment=$ENVIRONMENT" \
 --set="splunkPlatform.token=$HEC_TOKEN" \
 --set="splunkPlatform.endpoint=$HEC_URL" \
 --set="splunkPlatform.index=$INDEX" \
 -f ./values.yaml \
 splunk-otel-collector-chart/splunk-otel-collector

The resulting metrics provide a wealth of information about your GPU infrastructure and give you insight into whether the hardware is being used efficiently, or if you’re at risk of running out of GPU capacity, so you can take action before end-users are negatively impacted.

GPU Dashboard.png

Step 4: Instrument your Python application for Splunk Observability Cloud

The Python agent from the Splunk Distribution of OpenTelemetry Python can automatically instrument Python applications by dynamically patching supported libraries. Follow the steps here to start collecting metrics and traces from your Python-based Gen AI applications.

Activate Always On Profiling if you’d like to capture CPU call stacks for your application as well.

After the application has been instrumented, the service map shows the interaction amongst the various application components.

Service Map.png

Step 5: Enhance instrumentation with OpenLIT

The metric and trace data captured by the Splunk Distribution of OpenTelemetry Python can be enhanced with an open-source solution, such as OpenLIT.

Doing this requires only two steps. First, install the openlit package:

pip install openlit

Second, import the openlit package in your Python code, and then initialize it:

import openlit
…
# Initialize OpenLIT instrumentation
openlit.init()

You'll need to set a few environment variables in the Kubernetes manifest file used to deploy the application to ensure that the OpenLIT knows where to export the data to:

     env:
      - name: OTEL_SERVICE_NAME
       value: "my-service"
      - name: OTEL_RESOURCE_ATTRIBUTES
       value: "deployment.environment=my-environment"
      - name: SPLUNK_OTEL_AGENT
       valueFrom:
        fieldRef:
         fieldPath: status.hostIP
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
       value: "http://$(SPLUNK_OTEL_AGENT):4317"
      - name: OTEL_EXPORTER_OTLP_PROTOCOL
       value: "grpc"

After this is deployed, you'll see more detailed trace data, including all the calls to commercial and open-source LLMs, as well as interactions with vector databases.

If you click on a span that calls an LLM, you’ll see additional data such as the number of tokens utilized, the cost, and even the full prompt and response.

Trace with Span Attributes.png

While the prompt and response details are extremely helpful for debugging, they should be disabled for production environments to avoid capturing PII data. To do this, modify the code as follows (and as explained here):
openlit.init(trace_content=False)

If you enabled AlwaysOn Profiling, you will also see CPU call stacks captured for some of the spans in the trace, which tells you exactly where CPU time is being spent.

Trace with Profiling.png

Step 6: Start using the data

Now that you’re collecting detailed metric and trace data, we can use the data to understand how your application is performing. When latency or errors are discovered, you can use features such as Tag Spotlight to quickly understand the root cause.

Tag Spotlight.png

For example, if your application makes calls to OpenAI, you might encounter openai.RateLimitError errors, which can occur when your application has sent too many requests to the OpenAI API in a given period of time. Now that you’re aware of this error, use the guidance from OpenAI to reduce or eliminate these errors.

Summary

In this article, we provided an overview of the technologies organizations are using to build applications that leverage LLMs and generative AI.

We then provided a list of steps that can be followed to begin monitoring Gen AI applications using Splunk Observability Cloud:

Get access to Splunk Observability Cloud.
Deploy the Splunk Distribution of the OpenTelemetry Collector.
Capture NVIDIA GPU metrics.
Instrument your Python application for Splunk Observability Cloud.
Enhance instrumentation with OpenLIT.
Start using the data.

If you encounter any challenges with these steps, please ask a Splunk Expert for help.