Skip to main content

 

Splunk Lantern

Creating, monitoring, and optimizing LLM retrieval augmented generation patterns

 

The promise of generative AI and large language models (LLMs) is immense, offering transformative capabilities from automating customer support to streamlining internal knowledge management. However, bringing these AI capabilities into an enterprise environment comes with a unique set of challenges. Out-of-the-box LLMs lack the specific, trusted context of your business data. This can lead to hallucinations, where the AI confidently provides incorrect or irrelevant information, or just an inability to answer questions that are relevant to your operations. Sending vast amounts of data to an LLM can also become inefficient and costly due to token limits.

Retrieval augmented generation (RAG) can be used to prevent these problems. RAG allows you to efficiently ground LLMs in your proprietary data sources, ensuring their responses are accurate, relevant, and aligned with your business context. You can use RAG to do things like:

  • Provide customer service agents with instant access to product manuals, FAQs, and past resolutions.
  • Create intelligent internal knowledge bases from HR documents, engineering specifications, or transcribed meeting notes.
  • Build compliance assistants that can quickly reference legal documents and internal policies.
  • Develop smart search capabilities over archives of internal reports and research.

But building these applications is only the first step. After they've been deployed, you'll need to use observability to ensure they are performing optimally, controlling costs, and delivering high-quality, reliable answers.

This article walks you through the process of building a RAG pattern for your LLM applications using Python and LangChain. You'll then learn how to instrument these applications with OpenTelemetry and use Splunk Observability Cloud to gain actionable insights, troubleshoot issues, and continuously optimize their performance, cost-efficiency, and reliability in production.

You can also review the GitHub repo for the example application used in this article.

How to use Splunk software for this use case

Step 1: Build your application

This section outlines the core components of a RAG application, using Python and LangChain.

1.1 Load your data

Load your data into your application. LangChain provides various document loaders. We'll use the example of a PDF textbook throughout this article, but you can use any data source and file format you choose. See the LangChain docs for information on the different document loaders available for different data file formats.

from langchain_community.document_loaders import PyPDFLoader

# Replace your_textbook.pdf with your actual file path
loader = PyPDFLoader(
    your_textbook.pdf,
    mode="page"
)

textbook_pages = loader.load()

1.2 Create embeddings and store in a vector database

Convert your document pages into numerical representations called embeddings. These embeddings capture the semantic meaning of the text. Store these embeddings in a vector database for efficient similarity searches. In this example, we're using a vector database called Chroma, but there are many others you could use.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")

db = Chroma.from_documents(
    textbook_pages,
    embedding=embeddings_model,
    persist_directory="./chroma_db"
)

1.3 Initialize the vector database and LLM

Set up your application to interact with the vector database and your chosen LLM. In this example, we're using OpenAI's GPT-4o mini, but you can use any LLM that you choose.

from langchain.chat_models import init_chat_model
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings_model
)

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

1.4 Define your LLM prompt template

A well-defined prompt is important for instructing the LLM to use the provided context and avoid making up answers. In this example, we're explicitly telling the LLM to use the retrieved context to answer the question, and not to use whatever information it can find from the internet to answer. We're also telling the LLM to say it doesn't know the answer if it doesn't have the information needed to provide an answer.

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "You are an assistant for question-answering tasks. Use "
    + "the following pieces of retrieved context to answer the question. If you don't know the answer, "
    + "just say that you don't know. Use three sentences maximum and keep the answer concise.\n"
    + "Question: {question}\n"
    + "Context: {context}\n"
    + "Answer:"
)

1.5 Define functions to retrieve related documents and generate a response using the LLM

These functions handle finding relevant documents and generating the LLM's response. In the example below, we're creating two functions for the application to use:

  • the retrieve function, which takes the user's question and performs a similarity search in the Chroma vector store to look for related documents
  • the generate function to return a response from the LLM that uses the related documents and the prompt template
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

1.5 Build a graph and invoke it

In addition to using Langchain, we're also using a framework called LangGraph which is well-suited for building agentic AI applications. Create a graph with LangGraph and the two functions created in the previous step, and invoke it with a question. The LLM will respond with an answer based on the information you've provided it with.

from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

response = graph.invoke({"question": "Enter your question here"})

Step 2: Instrumenting with OpenTelemetry and Splunk Observability Cloud

Now that your RAG application is built, let's instrument it to send logs, metrics, and traces to Splunk Observability Cloud.

2.1 Install the Splunk Distribution of the OpenTelemetry Collector

The Splunk Distribution of the OpenTelemetry Collector acts as an agent, receiving telemetry data from your application and forwarding it to Splunk Observability Cloud. In the example below, replace $SPLUNK_REALM, $SPLUNK_MEMORY_TOTAL_MIB, and $SPLUNK_ACCESS_TOKEN with your specific Splunk Observability Cloud details.

curl -sSL https://dl.signalfx.com/splunk-otel-collector.sh > /tmp/splunk-otel-collector.sh;
sudo sh /tmp/splunk-otel-collector.sh \
--realm $SPLUNK_REALM \
--memory $SPLUNK_MEMORY_TOTAL_MIB \
-- $SPLUNK_ACCESS_TOKEN

2.2 Instrument your Python application

Use the Splunk distribution of OpenTelemetry Python to automatically instrument your application and its dependencies. In the example below, the service name used is back-to-school-with-gen-ai and the environment is deployment.environment=test. You'll need to replace these with appropriate environment variables.

# install the package
pip install "splunk-opentelemetry[all]"

# install instrumentation for packages used by the app
opentelemetry-bootstrap -a install

# use environment variables to tell OpenTelemetry how to report data
export OTEL_SERVICE_NAME=back-to-school-with-gen-ai
export OTEL_RESOURCE_ATTRIBUTES='deployment.environment=test'

# start the application with instrumentation
opentelemetry-instrument python app.py


Step 3: View data in Splunk Observability Cloud

After your application is running with OpenTelemetry instrumentation, telemetry data will flow into Splunk Observability Cloud, providing deep insights into its behavior.

3.1 View Service Map

In Splunk Observability Cloud, navigate to the Service Map. You'll see your service, along with its dependencies like OpenAI (for embeddings and LLM calls) and your Chroma vector database (represented as an SQLite database). This visual representation helps understand your application's architecture and dependencies at a glance.

clipboard_e7315eea7d0d12c7afb177931b82e4684.png

3.2 Analyze traces

Go to the Traces view in Splunk Observability Cloud. For each request to your RAG application, you'll see a detailed trace which should look similar to the image below.

clipboard_ee0e92357cc54deb178462dad0075decb.png

You'll see the following information in this area:

  • AI interaction icons: Splunk Observability Cloud automatically identifies and highlights interactions with LLM components. You'll see icons for:
    • embeddings: When your application calculates embeddings for the user's question.
    • vectordb: When your application performs a similarity search.
    • chat: When your application calls the LLM for a response.
  • Detailed spans: Click on individual spans within the trace to see granular details:
    • Vector database queries: For vector database interactions, you can see the exact SQL query used for the similarity search, even if generated by LangChain. This provides visibility into underlying operations you didn't explicitly write.
    • LLM request details: For chat interactions, you'll find information such as:
      • Token usage: The total number of input and output tokens used for that specific LLM request. This is a direct indicator of cost.
      • Cost: The estimated cost of the LLM interaction.
      • Full prompt and context: The exact question and the retrieved context sent to the LLM, allowing you to verify the quality of the input.
      • LLM response: The answer returned by the LLM.

By examining these traces, you can understand the flow, identify bottlenecks, and pinpoint exactly what information is being sent to and received from your LLM.

Step 4: Optimizing LLM application performance

With insights from Splunk Observability Cloud, you can now identify and implement optimizations. A common area for optimization in RAG applications is managing the amount of context sent to the LLM.

4.1 Identify optimization opportunities

From your traces, you might notice that LLM requests are using a high number of tokens. High token usage directly translates to higher costs and potentially longer response times.

4.2 Adjust the number of retrieved documents (k value)

The similarity_search function in LangChain (and similar frameworks) often retrieves a default number of related documents (for example, k=4). If your data source is relatively simple, you might not need that many documents to answer a question accurately.

Modify your retrieve function to reduce the k value:

def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(
        query = state["question"],
        k = 2 # Reduce the number of retrieved documents from default (e.g., 4) to 2
    )
    return {"context": retrieved_docs}

4.3 Verify improvements in Splunk Observability Cloud

After deploying this change, look at the new traces in Splunk Observability Cloud. You should see:

  • Reduced token usage: The token count for LLM interactions will decrease significantly (for example, from 925 to 583 tokens).
  • Lower cost: The estimated cost per LLM request will drop.
  • Improved latency: LLM response times might decrease, making your application faster.

While reducing k can save costs and improve performance, it's a trade-off. For more complex subject matter, you might need to increase k to ensure the LLM has enough context to provide an accurate answer. Use Splunk Observability Cloud to monitor the impact of these changes on both performance and response quality.

Next steps

The content in this article comes from the .Conf25 talk, OBS1107 - Observability for Gen AI: Monitoring LLM Applications with OpenTelemetry and Splunk, one of the thousands of Splunk resources available to help users succeed. In addition, these resources might help you understand and implement this guidance: