Deleting data from an index

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

You have data on a Splunk indexer that you want to get rid of. Some of it contains personally identifiable information and some contains passwords.

You recently learned that thedelete command doesn’t really delete data. The delete command actually marks specific data to be omitted from search results. The data is still alive on disk, it just won’t show up in results.

This isn't good enough. You want the data really deleted, without a single trace left behind on disk. You might be tempted to go onto the system and delete the data yourself. Don't do this. This will cause problems, especially if you delete specific events.

So what can you do?

Solution

The best way to get rid of data is to age it out.

The collect command is usually used to write to a summary index. Instead, we’re going to use this to copy events over to a new index.

The key thing to remember is we are using output_format=hec in order to preserve the source, source type and host. If you use the default, everything becomes stash and that’s not too useful.

This is process is search powered. If one of the indexers has issues returning data or search errors, you might end up missing some data in the new index.
This works well for smaller indexes, but you will not be moving multi TBs of data with a single search. For one, you will hit role quotas, dispatch directory out of disk space, etc. For large datasets, you might need to do multiple searches. Changing the index will impact searches, dashboards, and more.
Some customers did a double move, that is, old_index to new_index, purge data from old_index, move new_index to old_index. As you can imagine, this takes time.

These steps are for reference only. Each customer environment is different and has its own unique requirements.

Step 1 - Prepare

Questions to ask before going through through the remaining steps:

Do I have data actively being written to this index?
- If yes, we need to make sure that the data coming in is properly timestamped in order to avoid accidentally being caught in the move.
- The more specific the search to copy the data, the better off. Setting time boundaries is one example of specificity.
Is it easier to pause ingest or to point it to a new index?
- This can be accomplished in a couple of different ways, for example, deployment server to update the inputs. Or you can create a custom app to modify the index coming in. For more information on how to do this, see Route and filter data.
Do I have a lot of dashboards or searches depending on the index?
- This is a bit trickier. There would be a time period where these searches and dashboards would fail (about 24 - 48 hours) as we wait for the data to roll off.

Step 2 - Search

The best search defines what needs to be in the final index, excluding all the bad stuff you don’t want in it.

Make sure you get results counts of before and after… just to be safe.

For example, in my walk-through I used the following search to get a baseline:

index=bosley_test 
| stats count AS surveys BY CSAT_Score

Then I run the following search:

index=bosley_test CSAT_Score>=4 
| collect testmode=true output_format=hec index=bosley_new_index

Alternatively, you could just exclude what you don’t want:

index=bosley_test CSAT_Score<=3
| collect testmode=true output_format=hec index=bosley_new_index

Step 3 - Run

Get the counts from before the move and make sure you know how many should end up in the new index.
Ensure that the new or temp index has been created.
Depending on how you answered the questions in Step 1, stop the flow of data to that index, and point it to the new index.
Update any searches and dashboards to use the new index.
Run the search you created ahead of time, changing the collect command from testmode=true to testmode=false.
Run the search.
When it is done, start validating what you should be seeing in the new index. Make sure that the bad data isn’t there, and that the counts match what you expected. Ensure that the source, source type, and host are the same.
Set the retention on the old index to one day.

Step 4 - Wait

Wait until the retention period has passed.

Step 5 - Clean Up

After the data has been aged out of the index, you can delete the old index or, using Step 3 and a modified collect search, you can move it back to the old index.

Keep in mind that you’ll have to undo any modifications you did for ingest or searches and dashboards.

Exporting data

The process described above has another valuable use case. Imagine this: auditors have shown up. They want an export of all the data about a specific set of transactions, and you have to export it outside your cloud environment.

Instead of wasting time with retention on the old index, create a temp index. Write the appropriate collect search to get the data you want. Send it to the new index using the output_format=hec setting.

This is where the magic happens:

Set up the temp index to write to a DDSS bucket. Set the retention to 1 day. Wait, again.

The data will be moved to the DDSS bucket, and can be imported into an on-premises instance to export in any format the auditors desire.

Next steps

These additional Splunk resources might help you understand and implement this product tip:

Splunk Docs: delete command
Splunk Docs: collect command
Splunk Docs: Route and filter data