Monitoring Cisco network devices using gRPC

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

gRPC is a widely supported protocol for streaming data from a supported network device like the Cisco Nexus 9000 series NX-OS devices. This protocol allows you to stream various types of metric data, depending on your device configuration. gRPC efficiently handles large-scale data transmission, including operation requests and telemetry. Additionally, gRPC with gNMI supports network device manipulations, offering a modern alternative to older protocols like NETCONF or RESTCONF.

Design considerations for gRPC

Collecting metric data via gRPC can generate large volumes of granular information. To manage this, it's recommended to collect this data before ingestion into the Splunk platform. Telegraf can be configured to prepare and send this data to the Splunk platform, enabling correlation with other machine data for use cases in security, infrastructure, or service monitoring.

Cisco devices use the YANG data model to define and expose metrics and events, so data modeling in the Splunk platform can be performed on the remote procedure calls that invoke network elements. This model is also used with the NETCONF protocol and is familiar to many network administrators.

For more information on gRPC and related models, you can refer to these resources:

Implementing gRPC

This article outlines the steps to configure Cisco devices, Telegraf, and the Splunk platform for gRPC data ingestion. You'll follow these high-level steps:

Configure gRPC on Cisco devices.
Configure a Telegraf instance to input Cisco telemetry and connect to the Splunk platform via either the HTTP Event Collector or the Universal Forwarder.
Use the Splunk platform to correlate data.

Step 1: Configure gRPC on Cisco devices

gRPC is configured on your Cisco Nexus network devices by enabling the dial-out (streaming) telemetry available when using the OpenConfig support in Cisco configuration. If your device supports "Native" mode, it can accept pull requests for data (dial-in), but OpenConfig enables data streaming (dial-out). After creating your Telegraf server component with the appropriate input, you then configure the Cisco devices to send to that Telegraf input.

Determine if your device has OpenConfig support using the Cisco Feature Navigator tool, as shown in the following screenshot. This example uses the dropdown menus to look at a Cisco 9000 router that has iOS version 7.11.2, where you can see the support details for OpenConfig.
If OpenConfig is not installed already, download the OpenConfig package from the Cisco repository. For more details on ensuring OpenConfig is installed and functional, see this guidance.

The OpenConfig package contains an RPM file with a name like “mtx-openconfig-all-2.0.0.0-10.5.1.lib32_64_n9000.rpm”. Log into your iOS environment, install the RPM file and verify the installation:

n9300v-telemetry# install add mtx-openconfig-all-1.0.0.182-9.3.5.lib32_n9000.rpm activate 
Adding the patch (/mtx-openconfig-all-1.0.0.182-9.3.5.lib32_n9000.rpm)
[####################] 100%
Install operation 1 completed successfully 

Activating the patch (/mtx-openconfig-all-1.0.0.182-9.3.5.lib32_n9000.rpm)
[####################] 100%
Install operation 2 completed successfully
n9300v-telemetry# show version
Active Package(s):
 mtx-openconfig-all-1.0.0.182-9.3.5.lib32_n9000
n9300v-telemetry#

After installing the package on all of the relevant Cisco devices, use show commands to start and view gRPC configurations:

n9300v-telemetry# show run grpc
!Command: show running-config grpc
!No configuration change since last restart
!Time: Tue Jul 14 16:56:37 2020
version 9.3(5) Bios:version 
feature grpc
grpc gnmi max-concurrent-calls 16
grpc use-vrf default
grpc certificate gnmicert
n9300v-telemetry#

Step 2: Configure Telegraf to input Cisco telemetry and connect to the Splunk platform over HTTP Event Collector (HEC) or Universal Forwarder (UF)

Telegraf is an open source server agent that is used for collecting metrics and events from your Cisco devices. Compiled with the appropriate plugins, Telegraf can subscribe to a Cisco device that will stream telemetry data over gRPC.

First, you'll need to install Telegraf on your server or use an existing instance. Follow the Telegraf installation guide to do this. In brief, the steps you'll perform are:

Download Telegraf to your server instance.
Generate a custom configuration file that includes the appropriate inputs configuration from Cisco devices, and outputs configuration to send the data to the Splunk platform.
Compile the Telegraf binary that is then installed on the server instance.

Within these installation steps, you'll need to use the inputs configuration for Cisco devices, shown in the following screenshot. This configuration has many settings to configure the type of data that is supported, and how it is handled by Telgraf:

unnamed - 2024-08-23T101413.045.png

Here is a portion of the telegraf.conf file showing the configuration of inputs to gather useful metrics:

[global_tags]
[agent]
 interval = "30s"
 round_interval = true
 metric_batch_size = 1000
 metric_buffer_limit = 10000
 collection_jitter = "0s"
 flush_interval = "30s"
 flush_jitter = "0s"
 precision = ""
 hostname = "cisco"
 omit_hostname = true

[[outputs.http]]
 url = "http://198.18.133.23:8088/services/collector"
 data_format="splunkmetric"
 splunkmetric_hec_routing=true
 [outputs.http.headers]
  Content-Type = "application/json"
  Authorization = "Splunk abcd1234"
[[outputs.file]]
 files = ["stdout", "/tmp/gnmi.out"]
 rotation_max_size = "5MB"
 rotation_max_archives = 3
 data_format = "json"

 [[inputs.gnmi]]
 addresses = ["xr-1:57400","xr-2:57400","xr-5:57400","xr-6:57400","xr-7:57400","xr-8:57400"]
 ## define credentials
 username = "cisco"
 password = "cisco123"
 ## redial in case of failures after
 redial = "10s"

#*###########################################################################
#* IF-MIB TO GNMI
#*##########################################################################
 [[inputs.gnmi.subscription]]
  name = "infra-statistics"                                  
  origin = "Cisco-IOS-XR-infra-statsd-oper"
  path = "infra-statistics/interfaces/interface/latest/generic-counters"
  subscription_mode = "sample"
  sample_interval = "30s"
 [[inputs.gnmi.subscription]]
  name = "if-statistics"
  origin = "Cisco-IOS-XR-pfi-im-cmd-oper"
  path = "interfaces/interface-xr/interface/interface-statistics/full-interface-stats/"
  subscription_mode = "sample"
  sample_interval = "30s"
#*###########################################################################
#* CISCO PROCESS MIB TO GNMI
#*###########################################################################
 [[inputs.gnmi.subscription]]
  name = "CPUMetricData"
  origin = "Cisco-IOS-XR-wdsysmon-fd-oper"
  path = "system-monitoring/cpu-utilization/total-cpu-one-minute"
  subscription_mode = "sample"
  sample_interval = "30s"
 [[inputs.gnmi.subscription]]
  name = "CPUMetricData"
  origin = "Cisco-IOS-XR-wdsysmon-fd-oper"
  path = "system-monitoring/cpu-utilization/total-cpu-five-minute"
  subscription_mode = "sample"
  sample_interval = "30s"
 [[inputs.gnmi.subscription]]
  name = "CPUMetricData"
  origin = "Cisco-IOS-XR-wdsysmon-fd-oper"
  path = "system-monitoring/cpu-utilization/total-cpu-fifteen-minute"
  subscription_mode = "sample"
  sample_interval = "30s"
#*###########################################################################
#* BGP4-MIB / CISCO-BGP4-MIB TO GNMI
#*###########################################################################
 [[inputs.gnmi.subscription]]
  name = "instanceSpecificBGPData-update-messages-in"
  origin = "Cisco-IOS-XR-ipv4-bgp-oper"
  path = "bgp/instances/instance/instance-active/default-vrf/afs/af/neighbor-af-table/neighbor/update-messages-in"
  subscription_mode = "sample"
  sample_interval = "30s"
 [[inputs.gnmi.subscription]]
  name = "instanceSpecificBGPData-update-messages-out"
  origin = "Cisco-IOS-XR-ipv4-bgp-oper"
  path = "bgp/instances/instance/instance-active/default-vrf/afs/af/neighbor-af-table/neighbor/update-messages-out"
  subscription_mode = "sample"
  sample_interval = "30s"
 [[inputs.gnmi.subscription]]
  name = "instanceSpecificBGPData-connection-established-time"
  origin = "Cisco-IOS-XR-ipv4-bgp-oper"
  path = "bgp/instances/instance/instance-active/default-vrf/afs/af/neighbor-af-table/neighbor/connection-established-time"
  subscription_mode = "sample"
  sample_interval = "30s"
 [[inputs.gnmi.subscription]]
  name = "instanceSpecificBGPData-prefixes-advertised"
  origin = "Cisco-IOS-XR-ipv4-bgp-oper"
  path = "bgp/instances/instance/instance-active/default-vrf/afs/af/neighbor-af-table/neighbor/af-data/prefixes-advertised"
  subscription_mode = "sample"
  sample_interval = "30s"
 [[inputs.gnmi.subscription]]
  name = "instanceSpecificBGPData-prefixes-accepted"
  origin = "Cisco-IOS-XR-ipv4-bgp-oper"
  path = "bgp/instances/instance/instance-active/default-vrf/afs/af/neighbor-af-table/neighbor/af-data/prefixes-accepted"
  subscription_mode = "sample"
  sample_interval = "30s"

At this stage, you'll need to decide whether to send metrics to your Splunk platform instance using the HTTP Event Collector (HEC), or using the Universal Forwarder.

Option 1: Using HEC

Use the outputs configuration for HTTP to send metrics to your Splunk platform instance over the HTTP Event Collector, as shown in the following screenshot:

unnamed - 2024-08-23T101750.253.png

Here is an example telegraf.conf outputs configuration file that sends this data to the Splunk platform over HEC:

[global_tags]
 # dc = "us-east-1" # will tag all metrics with dc=us-east-1
 # rack = "1a"
 ## Environment variables can be used as tags, and throughout the config
 #user = "telegraf"
 index = "main"

[agent]
 interval = "30s"
 round_interval = true
 metric_batch_size = 1000
 metric_buffer_limit = 10000
 collection_jitter = "0s"
 flush_interval = "10s"
 flush_jitter = "0s"
 precision = ""
 debug = false
 quiet = false
 logtarget = "file"
 logfile = "/var/log/telegraf/telegraf.log"
 logfile_rotation_interval = "0d"
 logfile_rotation_max_size = "1MB"
 logfile_rotation_max_archives = 5
 hostname = ""
 omit_hostname = false

[[outputs.http]]
  ## URL is the address to send metrics to
  url = "https://my-splunk-instance:8088/services/collector"

  ## HTTP method, one of: "POST" or "PUT"
  method = "POST"
 
  # DEV ONLY
  insecure_skip_verify = false

  data_format = "splunkmetric"
  splunkmetric_hec_routing = true

  ## Additional HTTP headers
  [outputs.http.headers]
   Content-Type = "application/json"
   Authorization = "Splunk use-your-splunk-token"
   X-Splunk-Request-Channel = "use-your-splunk-token"

Option 2: Use the Universal Forwarder

You can configure your Telegraf instance to log the data locally then use the Universal Forwarder (UF) to forward that data to the Splunk platform using traditional UF configuration principles, in the same way you would for other log monitoring.

To do this, use the Telegraf file output plugin. Configure the local file logging output like this:

# Send telegraf metrics to file(s)
[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["/tmp/metrics.out"]
  ## Data format to output.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here
 ##https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md   
  data_format = "splunkmetric"
  hec_routing = false

Configure a props.conf file in the same way you would configure one for other logs you might be gathering with the UF to ready the data for ingestion:

[telegraf]
category = Metrics
description = Telegraf Metrics
pulldown_type = 1
DATETIME_CONFIG =
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = true
disabled = false
INDEXED_EXTRACTIONS = json
KV_MODE = none
TIMESTAMP_FIELDS = time
TIME_FORMAT = %s.%3N

Use the Splunk platform to view gRPC data

After your network devices are configured to send data to Telegraf and then to the Splunk platform, you can correlate this data with other important parts of your indexed machine data to allow you to make better decisions and drive actions in your environment.

You can use Search Processing Language (SPL) to analyze the data. Adjust the following SPL to fit your environment:

| mpreview index "mertics_data 
| search "metric_name:infra-statistics.packets_received"=*"

The following screenshot shows a search on the data gathered from the telegraf.conf file shown previously:

unnamed - 2024-08-23T103533.158.png

After running this search and expanding an event containing its metric, you'll see this view:

unnamed - 2024-08-23T103959.890.png

With the metric data residing in a Splunk platform metrics index, you can run this SPL queries on this data. The following query calculates the average number of packets received across all your devices over a one-minute interval, allowing you to monitor network performance:

| mstats avg("infra-statistics.packets_received") AS "packets_received" WHERE index="metrics_data" span=1m

This results in a table view:

unnamed - 2024-08-23T104255.190.png

You can switch to a Visualization view to visualize these metrics in dashboards to gain insights into your network's performance:

unnamed - 2024-08-23T104523.279.png

The following query retrieves the packets received as before, but as a rate per second, then rounds that rate to allow for efficient determinations that can be acted upon. Also added is the Interface Name from the metrics, which helps identify where in your Cisco infrastructure this traffic is hitting and from what source:

| mstats rate_avg("infra-statistics.packets_received") AS "packets_received/s" WHERE index="mertics_data" BY source, interface_name, span=1m
| eval "packets_received/s" = round('packets_received/s', 2)

This results in this table view:

unnamed - 2024-08-23T132057.661.png

As well as developing your own SPL queries, you can also navigate the Analytics tab of your Splunk platform instance and use an interface to view and visualize this metric data. In the Analytics, identify the index that contains the metric data, then select from the data present in that index. You don't have to know the metric names beforehand.

unnamed - 2024-08-23T132232.889.png

After selecting the index and dimensions, you are shown all of the possible metrics and can select what you want to visualize:

unnamed - 2024-08-23T132336.443.png

You are then shown that metric data in a timeseries format without having to develop your own SPL:

unnamed - 2024-08-23T132346.631.png

Next steps

These resources might help you understand and implement this guidance:

Cisco Blogs: Which YANG model to use
Cisco Docs: gNMI configuration guide
Cisco White Paper: Cisco Nexus 9000 white paper
Github: gNMI specification