Defining data retention policies

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Defining clear data retention policies is often essential to ensure compliance with regulations and meet specific business needs. In this section, we will guide you through crafting effective data retention policies in Splunk Enterprise or Splunk Cloud Platform.

This section outlines the following steps to help you create a balance between data availability, compliance requirements, and storage efficiency.

Understanding compliance requirements and business needs
Categorizing data types
Determining appropriate retention periods
Setting bucket roll behavior
Implementing automatic data archival
Testing and validating
Communicating and documenting

Understanding compliance requirements and business needs

Begin by researching the relevant industry regulations and legal obligations that apply to your organization. Here are the steps you should take:

Identify Applicable Regulations: The first step is to identify the relevant regulations that apply to your industry and region. Depending on your organization's sector, you might be subject to specific data protection laws, such as GDPR (General Data Protection Regulation) in the European Union, HIPAA (Health Insurance Portability and Accountability Act) in the healthcare industry, or PCI DSS (Payment Card Industry Data Security Standard) for credit card processing. Understand the specific requirements and data retention obligations outlined in these regulations.
Involve Key Stakeholders: Collaborate with key stakeholders from relevant departments, including legal, compliance, IT, security, finance, and data governance. Each department will have its own unique data needs and retention requirements. Engaging these stakeholders early on will ensure that the data retention policies align with both regulatory demands and your organization's overall objectives.
Determine Sensitive Data: Identify and classify sensitive data elements within your organization. This might include personally identifiable information (PII), financial records, proprietary information, trade secrets, and any other data that requires special protection. Categorize these sensitive data types separately, as they might have more stringent retention requirements.
Assess Data Usage Patterns: Understand how different data types are being used within your organization. Some data might be accessed frequently for operational purposes, while other data might be required only for historical analysis or compliance audits. Analyzing data usage patterns will help you tailor retention policies to optimize data availability while minimizing storage costs. For example, the following search looks at audit data to analyze the usage of different indexes in user-initiated searches. It filters out automated, system, and certain types of searches to focus on the actual usage of indexes by users and sorts the results based on the indexes that are searched most often.
index=_audit action=search search=* info=completed NOT "search_id='scheduler" NOT "search=' |history" NOT "user=splunk-system-user" NOT "search='typeahead" NOT "search=' | metadata type=* | search totalCount>0" | rex field=search "index=(?P<search_index>[^ ]+)" | stats count by search_index | sort - count
Define Business Needs: Work closely with business units to identify their specific data retention needs. For example, the marketing team might need customer data for a longer duration to analyze campaign effectiveness, while the HR department might have specific retention requirements for employee records. Understanding these business needs will ensure that data retention policies are practical and serve your organization's day-to-day operations effectively.
Risk Assessment: Perform a risk assessment to identify potential data security and privacy risks associated with data retention. Consider the consequences of retaining data for extended periods, such as exposure to data breaches or unauthorized access. This assessment will help you strike a balance between retention requirements and data protection.
Document Compliance Requirements and Business Needs: Document all the information gathered during this phase, including applicable regulations, stakeholders' inputs, sensitive data types, usage patterns, business needs, and risk assessment results. This documentation will serve as a foundation for developing comprehensive and well-informed data retention policies.

Categorizing data types

Organize your data into distinct categories based on their importance, sensitivity, and usage patterns. This categorization might also include customer data, financial records, operational logs, and more. Assigning each data type to a specific retention category will make it easier to set retention periods later. Common categories might include:

Critical Data: This category includes highly sensitive data, such as PII (Personally Identifiable Information), financial records, intellectual property, and other confidential information. Critical data often requires the longest retention periods to meet regulatory requirements and support legal compliance.
Operational Data: This category includes data essential for day-to-day operations, like logs, performance metrics, and system status information. Operational data might have shorter retention periods, as it is usually required for immediate troubleshooting and analysis.
Analytical Data: This category encompasses data used for long-term trend analysis, business intelligence, and reporting. The retention period for analytical data might vary based on your organization's specific needs and the insights derived from historical data.
Temporary Data: This category includes transient data that serves a short-term purpose, such as temporary caches, or temporary storage for intermediate results. Temporary data typically has the shortest retention periods, often measured in days or hours.

Determining appropriate retention periods and policies

After you categorize data types, assess the optimal retention period for each category. For instance, financial data might require more extended retention periods for compliance, while temporary logs might only need to be retained for a shorter duration. Strive to strike a balance between regulatory requirements and storage costs. Here are some different considerations for specific data categories:

Regulatory Requirements: Consider the data retention requirements mandated by relevant industry regulations and legal frameworks. Ensure that retention periods comply with these obligations to avoid potential penalties or legal consequences.
Business Needs: Refer back to the information gathered in regulatory requirements, specifically the inputs from stakeholders and the risk assessment. Align the retention periods with the business needs and usage patterns identified earlier.
Data Usage Frequency: Analyze how frequently each data category is accessed and for what purposes. Frequent access might necessitate longer retention periods, while infrequently accessed data could have shorter retention periods.
Storage Cost Considerations: Longer retention periods can result in increased storage costs. Strive to strike a balance between regulatory compliance and managing storage expenses effectively. Additionally, implementing a tiered storage strategy for less frequently accessed data can also be a cost-effective way to manage large volumes of data.
Data Sensitivity: Highly sensitive data might require extended retention periods for forensic purposes, while less sensitive data might be age out more quickly to minimize exposure to potential security risks.

Based on the assessment, create clear and well-defined data retention policies for each data category. Document the retention periods, the rationale behind each policy, and any exceptions or special considerations.

Setting bucket roll behavior

After you've defined your data retention periods, the next step is configuring bucket roll behavior in the Splunk platform. This ensures your data storage practices align with your retention strategies.

To determine when data rolls from one bucket stage to another, modify the maxTotalDataSizeMB, frozenTimePeriodInSecs, and maxVolumeDataSizeMB attributes in the indexes.conf file.

maxTotalDataSizeMBdetermines the maximum combined size of hot and warm buckets in megabytes that the Splunk platform can store on a single indexer. When this limit is reached, the least recently used warm buckets are rolled to cold status or frozen (depending on your configuration). This parameter helps control the overall storage usage on an indexer, preventing it from becoming overloaded with data.
frozenTimePeriodInSecssets the time period, in seconds, after which cold buckets can be frozen. Freezing a bucket means that it's moved to a separate frozen volume, making it read-only and preventing further modifications. This parameter is useful for optimizing the performance of cold data storage. By moving older data to separate volumes, you optimize the storage resources used by the more frequently accessed hot and warm buckets.
maxVolumeDataSizeMB specifies the maximum size of a volume in megabytes. In the Splunk platform, data is stored in volumes, which are logical groupings of storage capacity. This parameter is helpful in controlling how much data can be stored in a single volume. When a volume reaches its maximum capacity, new data is stored in a new volume, distributing data across multiple volumes for efficient storage management.

Volumes are key to managing data storage in the Splunk platform. You can configure data retention policies for different volumes based on the frequency of access and business requirements. For instance, you might create separate volumes for hot, warm, and cold data based on their usage patterns. This allows you to apply specific policies to each type of data, optimizing storage usage and access speed.

By leveraging different volumes and these three parameters, you can achieve a balanced approach to data retention and management in the Splunk platform. The maxTotalDataSizeMB and frozenTimePeriodInSecs parameters help control storage capacity and optimize cold data storage, while the concept of volumes enhances the efficiency of data storage and retrieval, ensuring that your Splunk environment remains performant and resource-efficient over time.

Implementing automatic data archival

Leverage the power of data lifecycle policies in the Splunk platform to automatically manage data retention and deletion. Data lifecycle policies in Splunk Enterprise or Splunk Cloud Platform allow you to specify the retention period for each data category and set up automatic data deletion once the specified duration has passed. This helps to maintain compliance and keep storage space in check, avoiding unnecessary data buildup.

To let the indexer handle data archiving automatically, you can use the coldToFrozenDir attribute in indexes.conf. This attribute specifies the location where frozen data will be archived. Add the following stanza to $SPLUNK_HOME/etc/<your_app>/local/indexes.conf:

[<index>] coldToFrozenDir = <path to frozen archive> Replace <index> with the index containing the data to archive and <path to frozen archive> with the directory where the archived buckets will be stored. Splunk Web also allows specifying a frozen archive path when creating a new index.

If the coldToFrozenDir attribute is not specified in the indexes.conf configuration file, the default behavior in Splunk Enterprise or Splunk Cloud Platform is to delete frozen data from the index when data reaches the frozen state.

Specifying an archiving script

If you need more control over the archiving process or want to perform custom actions during archiving, use the coldToFrozenScript attribute in indexes.conf. This attribute allows you to specify a user-supplied script that the indexer will run just before erasing the frozen data from the index. The script could perform archiving, data transfer, or other actions as needed.

Add the following stanza to $SPLUNK_HOME/etc/<your_app>/local/indexes.conf:

[<index>] coldToFrozenScript = ["<path to program that runs script>"] "<path to script>"

Replace <index> with the index containing the data to archive, <path to script> with the path to your custom archiving script located in $SPLUNK_HOME/bin or its subdirectories, and <path to program that runs script> (optional) if your script requires a specific program to run it.

Example

[myindex] coldToFrozenScript = "$SPLUNK_HOME/bin/python" "$SPLUNK_HOME/bin/myColdToFrozen.py" You can read more about archiving index data here.

Managing archiving in clusters

Managing archiving in clusters requires careful planning to maintain data consistency and avoid conflicts. If you have an indexer cluster with data replication, be aware that enabling archiving on multiple nodes can lead to multiple copies of the archived data. To avoid name collisions, ensure each peer node archives data to a separate directory, if using shared storage volumes.

Using the `timePeriodInSecBeforeTsidxReduction` parameter

The timePeriodInSecBeforeTsidxReduction parameter specifies the time period in seconds before tsidx reduction occurs. Tsidx reduction is the process of removing unnecessary tsidx files (index files containing metadata) to free up disk space at the cost of search performance. This parameter determines how long the Splunk platform will wait before triggering tsidx reduction after a segment becomes inactive.

When to Use timePeriodInSecBeforeTsidxReduction:

Disk Space Versus Performance: The decision to use timePeriodInSecBeforeTsidxReduction depends on your organization's priorities. If you want to free up disk space more quickly, you can reduce the timePeriodInSecBeforeTsidxReduction value. On the other hand, if performance is a higher concern, a longer time period might be preferred to avoid unnecessary tsidx reduction operations during periods of high search activity.
Predictable Search Patterns: Consider your organization's search patterns. If you notice that certain data becomes less relevant or is no longer frequently searched after a specific time, you can set a time period that aligns with the decreasing relevance of that data. For instance, if you know that data older than a month is rarely queried, you could set the timePeriodInSecBeforeTsidxReduction accordingly.
Indexing Rate and Volume: The indexing rate and data volume play a role in determining the appropriate value for timePeriodInSecBeforeTsidxReduction. If your environment has a high indexing rate and generates a substantial amount of data, you might need more frequent tsidx reduction to manage disk space.
Resource Constraints: If your system has limited disk space and you want to manage it more aggressively, you can consider decreasing the timePeriodInSecBeforeTsidxReduction value.

To configure timePeriodInSecBeforeTsidxReduction, locate the relevant index in the indexes.conf file and set the desired value in seconds. For example:

[<index>] timePeriodInSecBeforeTsidxReduction = 604800 # One week in seconds

After making the configuration change, monitor the impact on disk space usage, search performance, and tsidx reduction operations. Keep in mind that finding the right balance might require adjustments and testing based on your specific environment and use cases.

Testing and validating

Before deploying your data retention policies into a production environment, conduct thorough testing and validation. Ensure that the automatic data archival works as expected without causing unintended data loss. Run simulations or test environments to verify the impact on data accessibility and performance. Here are the steps to take:

Create a Test Environment: Before implementing data archiving in a production environment, set up a test environment that closely resembles the production setup. This includes using a similar dataset and data volumes to simulate real-world conditions.
Test the Archiving Script: If you are using a custom archiving script (specified by the coldToFrozenScript attribute), thoroughly test the script in the test environment. Ensure the script performs the archiving process efficiently and handles potential errors gracefully. The script should copy or transfer data to the designated archive location correctly and not cause any data corruption.
Verify Data Restoration (Thawing): If your archiving process involves data restoration ("thawing") at a later stage, verify that the restoration process works as expected. Test the script or method for restoring archived data and ensure that the data is accessible and usable after restoration.
Monitor and Log: Implement monitoring and logging mechanisms to track archiving activities in the test environment. Monitor disk space usage, archiving duration, and any potential issues that might arise during the archiving process. Enable appropriate log levels to capture relevant information for troubleshooting.
Test Edge Cases: Test the archiving process under various scenarios, including edge cases. For example, test the script's behavior when archiving large volumes of data, when disk space is limited, or when multiple archiving operations are running concurrently.
Check Data Integrity: After archiving data, conduct data integrity checks to ensure that the archived data matches the original data in the index. Compare checksums or hashes of the archived data with the original data to verify accuracy.
Test Backup and Restore: In parallel with the archiving process, perform backup and restore tests to ensure that archived data can be reliably restored in case of any disasters or system failures.
Test Performance: Measure the performance impact of the archiving process on the overall system. Monitor CPU usage, disk I/O, and memory consumption during archiving to assess its effect on system resources.
Document Results: Keep detailed records of the testing process, including the configurations used, test results, any issues encountered, and their resolutions. Document the archiving script's behavior and any modifications made to the script during testing.
Review and Iteration: Based on the test results, review the archiving process and script for any improvements or optimizations. Address any issues found during testing and make necessary adjustments to ensure a robust and reliable archiving mechanism.
User Acceptance Testing (UAT): After the archiving process has been thoroughly tested and validated in the test environment, consider conducting UAT with a subset of end-users in the production-like environment. This will help gather feedback from users and validate that the archiving process aligns with their requirements.

By conducting rigorous testing and validation of data archiving, you can ensure a smooth and reliable implementation of the archiving process in your production environment. Regularly review and update archiving practices as your data and system requirements evolve, and maintain proper monitoring and auditing to ensure ongoing effectiveness and compliance with data retention policies.

Communicating and documenting

Communicate the new data retention policies to all relevant stakeholders within your organization. Document the policies clearly and provide accessible guidelines for employees to follow. Ensure that everyone understands the rationale behind the policies and their roles in adhering to them.

Communication to Relevant Stakeholders: After the new data retention policies are established, it is crucial to communicate them effectively to all relevant stakeholders within your organization. This includes data owners, data custodians, IT personnel, legal and compliance teams, and other key individuals involved in data management. Hold meetings, workshops, or presentations to disseminate the information and address any questions or concerns.
Rationale Behind Policies: When communicating the data retention policies, provide a clear explanation of the rationale behind them. Help stakeholders understand the reasons for implementing these policies, such as regulatory compliance, data protection, storage optimization, and improved data accessibility. Emphasize the benefits of adhering to these policies, including reduced risks, streamlined operations, and better data governance.
Roles and Responsibilities: Clearly define the roles and responsibilities of different stakeholders in adhering to the data retention policies. Ensure that each individual understands their specific responsibilities regarding data retention, archiving, and deletion. This might include data owners being responsible for defining retention periods, IT personnel implementing archiving procedures, and legal teams ensuring compliance with relevant regulations.
Documentation of Policies: Document the data retention policies in detail, outlining the specific rules and guidelines for each type of data and corresponding retention periods. Use clear and straightforward language to make the policies easily understandable to all employees. Organize the documentation in a structured manner, dividing it into sections to address various data categories and retention requirements.
Accessible Guidelines: Make the data retention policies easily accessible to all employees by sharing the documentation through appropriate channels. Consider storing in a centralized repository, such as an intranet site or a knowledge base, where employees can access the policies whenever needed. Provide links to relevant documents and resources for further clarification.
Periodic Review and Updates: Data retention needs might evolve over time due to changing business requirements or regulatory updates. Plan for periodic reviews of the data retention policies to ensure their continued relevance and effectiveness as your organization's data landscape, business needs, and regulatory requirements evolve. You should also stay up-to-date with changes in compliance regulations to ensure ongoing adherence to best practices. Keep all stakeholders informed about any updates and changes to the policies.
Consistent Enforcement: Enforce the data retention policies consistently across your organization to ensure uniform data management practices. Monitor compliance and address any instances of non-compliance promptly. Implement appropriate measures for continuous improvement and to address any challenges faced during policy implementation.

By effectively communicating and documenting the new data retention policies, organizations can create a transparent and accountable approach to data management. Ensuring that all employees understand their roles in adhering to these policies will foster a data-centric culture and promote responsible data practices throughout your organization.

Helpful resources

This article is part of the Splunk Outcome Path, Optimizing storage. Click into that path to find more ways to develop a systematic approach to managing capacity, as well as strategies for data retention, and data lifecycle management.

In addition, these resources might help you implement the guidance provided in this article:

Splunk Docs: How the indexer stores indexes
Splunk Docs: Archive indexed data
Splunk Docs: Restore archived indexed data
Splunk Docs: Use summary indexing for increased search efficiency
Splunk Docs: Set a retirement and archiving policy
Product Tip: Setting data retention rules in Splunk Cloud Platform