Automating alert investigations by integrating LLMs with the Splunk platform and Confluence

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

As a Splunk user, you know that data is the key to everything. But you also know that the data you need is often spread across multiple systems. Your IT tickets are in ServiceNow, your runbooks are in Confluence, your team chatter is in Slack, and your critical operational data is right here in the Splunk platform.

For an IT or Security analyst, context switching can become a major limitation on efficiency. Investigating even a simple alert can involve bouncing between three or four different browser tabs, manually copying and pasting information, and trying to connect the dots. This slows down your mean time to resolution (MTTR) and adds friction to your workflow.

But what if you could bring your tools together into a single, unified interface? In this use case, we'll show you how the Splunk Model Context Protocol (MCP) server for Splunk Cloud Platform helps you to do this. We'll walk through a practical example of how connecting a Large Language Model (LLM) to both the Splunk platform and Confluence can transform your incident response process, turning a multi-step manual investigation into an automated, conversational workflow.

Data required

Atlassian: Confluence

Scenario

Let's imagine you're an IT operations analyst at an e-commerce company. Your "StockSavvy" SaaS application is hosted in AWS, and you rely on the Splunk platform for monitoring and Confluence for your operational runbooks.

You've just received a "database connection pool exhaustion" alert. This is a critical issue that could impact customers, so you need to act fast.

The old way would involve:

Opening the alert in the Splunk platform.
Logging into Confluence to find the "StockSavvy Production Runbook."
Reading the runbook to figure out which Splunk queries to run.
Switching to the Splunk platform to start your investigation.

Let's look at a better way using a chat application integrated with both Atlassian and the Splunk platform via the Model Context Protocol (MCP).

1. Add the Atlassian MCP to your LLM

If you haven't already, follow the configuration guidance in Leveraging Splunk MCP and AI for enhanced IT operations and security investigations. To add the Atlassian MCP, when using a desktop LLM application that supports MCP, add the following JSON to your configuration file:

{  
      "mcpServers": {  
        "atlassian": {  
          "command": "npx",  
          "args": ["-y", "mcp-remote", "https://mcp.atlassian.com/v1/sse"]  
        }  
      }  
    }

The first time your LLM calls an Atlassian tool, it will route you through a one-time authentication to ensure you have the correct permissions.

2. Prompt the LLM

To ensure the LLM follows our official runbook, we provide this prompt:

I am a support engineer that needs to be certain the StockSavvy SaaS application runs well. I just received a "database connection pool exhaustion" alert. I believe there is a runbook hosted in Confluence that outlines the procedure to diagnose and resolve it. Assuming that exists, please review it, and then use what you learn there to conduct an initial investigation using Splunk. If you have questions or get confused, stop and ask me for guidance.

Notice that we didn't prescribe anything to the LLM, but we did include some key context hints to guide it towards our MCP servers. Mentioning Confluence suggests that it should call tools from the Atlassian MCP, and mentioning the Splunk platform does the same for the Splunk MCP. The prompt then includes key terms, such as the name of the alert and the type of page to find in Confluence (a runbook).

3. Search Confluence for the runbook

The LLM uses the Atlassian MCP to find and read the runbook page. It then synthesizes that information into an actionable plan.

The runbook provides clear steps for diagnosing database connection pool exhaustion. I'll use Splunk to investigate, starting by searching for symptoms like "503 Service Unavailable" or "Too many connections" errors. First, I'll get information about the Splunk instance.

The Atlassian tools that it uses to achieve this include:

getAccessibleAtlassianResources: An initial tool call that returns instance information that is used in subsequent calls.
searchConfluenceUsingCql: The tool searches for pages with the phrase "database connection pool exhaustion" and finds the runbook page.
getConfluencePage: The tool fetches the page, finds relevant information about how to deal with our specific alert, lets us know, and pivots to the Splunk platform.

4. Use the Splunk platform to identify root cause

This is where the power of integrating multiple tools becomes clear. The LLM doesn't just read a procedure back to us it found in Confluence; it also begins an investigation by calling tools from the Splunk MCP.

Here's a play-by-play of the LLM's thought process:

Run initial search: Based on the Confluence runbook, the LLM knows it should look for common web errors. It runs its first tool call: run_splunk_query looking for errors like 503 Service Unavailable or Too Many Connections.
Discover available data: It runs a new tool, get_indexes, to see what Splunk indexes are available. This is the same as an analyst running their own discovery to understand the data landscape. It finds the stocksavvy index, and runs get_metadata to learn what source types are associated with the index.
Adapt when a query fails: The initial queries to the source type come back empty; simple keyword searches aren't effective. The LLM reasons that the errors mentioned might present as spikes in connection requests, so it tries to generate a timechart. When that also fails, it expands the time window, and begins to see results. On one particular day, there are a large amount of AUTH_FAILURE errors associated with the database.
Dig deeper: After finding the failures, it looks at a different source type and finds more information that supports the hypothesis.

The LLM demonstrates an intelligent, adaptive workflow: Plan (Confluence) > Run (Splunk) > Adapt (Get more Splunk data) > Re-run (Splunk).

5. Explain findings and present a resolution plan

When all of the information is identified, the LLM synthesizes it into a summary report and resolution plan that the analyst can follow or ask the LLM to run.

This empowers a junior analyst to follow an expert-level investigative process, all guided by the LLM.

The final report is comprehensive and precise:

I have completed the initial investigation based on the Confluence runbook and my analysis of the Splunk logs. Here is a summary of my findings and recommended next steps.

Investigation Summary
My investigation of the Splunk logs has revealed strong evidence that the order-service is the primary cause of the "database connection pool exhaustion" alert.

Here's a timeline of the events I uncovered from the logs on July 9th:

High Latency: The recommendation-service began experiencing high latency, with P99 latency breaching the 500ms SLO. This was likely an early symptom of the downstream database issues.

Connection Timeouts: The order-service started logging "Failed to get a connection from the pool" and "Timeout waiting for a connection from the pool" errors. This is the "smoking gun" that directly points to connection pool exhaustion.

Authentication Failures: Simultaneously, the database_audit logs show a spike in "AUTH_FAILURE" events. These are likely a direct result of the connection pool being exhausted, causing the database to reject new connection attempts.

Order Processing Failures: The order-service began logging messages indicating that it was failing to process orders and adding them to a retry queue.

High Queue Depth: The queue-monitoring-service logged a warning that the order_retry queue depth was high, confirming that the order-service was unable to keep up with its workload.

Next Steps from the Runbook
Based on the Confluence runbook, here are the recommended next steps to resolve this issue:

Restart the order-service: This will release any stale connections and should provide immediate relief.

Temporarily increase the database connection pool limit: This will provide more headroom and prevent the issue from recurring in the short term.

Investigate and fix the root cause: This is a long-term solution that will require a deeper dive into the order-service's code. The investigation should focus on:

Optimizing database queries: Look for inefficient queries that may be holding connections for too long.

Implementing connection pooling: If not already in use, a connection pooling library (like pgbouncer, as mentioned in the runbook) should be implemented to manage connections more efficiently.

I have completed my initial investigation and am ready for your guidance on how to proceed. Please let me know if you would like me to take any of the actions listed above, or if you have any other questions.

Next steps

Now that you've completed this use case, check out Leveraging LLM reasoning and ML capabilities for alert investigations to see how you can use machine learning capabilities within LLMs for faster investigations and resolutions.