Skip to main content

 

Splunk Lantern

Establishing data retention policies

 

Data retention, at its core, refers to the policies and practices that determine how long data is stored and retained within the Splunk platform. An integral part of data management is that only relevant data is stored, while outdated or unnecessary data is systematically purged. However, retention isn't just about storage. It's about striking the right balance between resource utilization, data availability for analysis, and compliance with internal and external regulations.

The Splunk platform is often flooded with vast amounts of data. In this context, data retention isn't a mere feature, it's an integral part of data management that ensures only relevant data is stored, while outdated or unnecessary data is systematically purged.

This section covers the following data retention topics:

  1. Defining the lifecycle of data in the Splunk platform
  2. Identifying stakeholders for data retention discussions
  3. Defining data retention policies
  4. Implementing data retention in the Splunk platform
  5. Securely deleting data
  6. Monitoring and auditing data retention practices

Defining the lifecycle of data in the Splunk platform

  • Ingestion: The journey of data in the Splunk platform begins with ingestion. As data streams in, the Splunk platform indexes it, making the raw data searchable and analyzable. During this phase, you must ensure data is categorized correctly and the right meta-information is associated with it, as this impacts retention decisions later on.
  • Storage and Analysis: After it is ingested, data resides in the indexes and is available for analysis. In the indexes, data can be subjected to numerous operations: searches, reports, alerts, and more. Data retention policies during this phase revolve around optimizing storage, ensuring that frequently accessed data is readily available, while less critical data might be moved to slower, more cost-effective storage.
  • Aging and Rolling: As data ages and passes the threshold defined by the retention policies, it transitions between several buckets within the Splunk platform. From "hot" to "warm," then "cold," and possibly even to "frozen," each stage represents a different storage and accessibility status. This systematic aging and rolling process, driven by the Splunk platform's configurations, ensures efficient storage use.
  • Deletion or Archival: The final stage in the data lifecycle is its removal from active storage. Based on the retention policy, after data surpasses its defined lifespan, the Splunk platform either deletes it or moves it to a "frozen" state, which could mean archival for long-term storage or complete removal.

This lifecycle underscores the dynamic nature of data within the Splunk platform. By understanding each stage, from the point of ingestion to eventual deletion or archival, you can better tailor your data retention policies, ensuring they are both compliant with regulations and efficient in terms of storage resource management.

Identifying stakeholders for data retention discussions

Misalignment among stakeholders can result in inefficiencies, increased costs, or even legal repercussions. Here's why identifying stakeholders from multiple business units is essential:

  • Consistency: Ensuring that retention policies are uniform across different datasets and departments avoids confusion and potential errors.
  • Compliance: With the combined expertise of IT and legal teams, your organization can ensure its policies are both technically feasible and legally compliant.
  • Optimization: By understanding the specific data needs of various business units, IT can optimize storage, ensuring critical data is readily available while archiving or deleting outdated information.

Key stakeholders typically include:

  • IT Teams: Responsible for the technical implementation and maintenance of the Splunk platform, they possess firsthand knowledge of the platform's capabilities, limitations, and current configurations.
  • Legal and Compliance Units: They ensure that data retention practices adhere to regulatory requirements and organizational legal obligations. Their input is crucial in setting minimum and maximum retention periods.
  • Business Units: Different departments might have varying needs concerning data accessibility and retention. For instance, the marketing team's requirements might differ considerably from those of the finance department.

By involving the right stakeholders and fostering open communication, organizations can craft data retention policies in the Splunk platform that are both effective and reflective of their unique operational and compliance needs.

Defining data retention policies

Data retention policies provide a structured approach to managing the duration for which data is stored, ensuring it's available when needed but removed when no longer necessary. Here's a detailed look into the elements to consider when defining these policies.

  • Regulatory and Compliance Requirements: Different industries and regions have distinct laws and regulations dictating the minimum or maximum periods for retaining certain data types. For instance, financial transaction data might have a different retention period than consumer interaction data due to financial regulations. Regularly review industry-specific guidelines and regional data protection laws to ensure that your policies remain compliant.
  • Business Needs and Operational Requirements: Not all data has the same value or utility over time. While some datasets might be crucial for long-term trend analysis, others might lose their relevance quickly. Collaborate with various departments to learn how often they access older data and the implications of not having that data available.
  • Storage Constraints and Costs: While retaining data can be beneficial, be mindful of the associated costs, both in terms of physical storage and potential performance implications. Analyze the growth rate of your data, project future storage needs, and factor in costs to determine an economically feasible retention period.

Workshops and sessions to gather requirements

To capture the diverse needs and considerations for data retention, conduct the following requirements gathering sessions:

  • Initial Workshops: Begin with broad sessions, introducing the goals of the data retention initiative and gathering preliminary insights from different teams.
  • Focused Discussions: Conduct more in-depth discussions with individual departments or teams, diving into their specific data needs, access frequencies, and retention requirements.
  • Feedback Loops: After drafting an initial retention policy, share it with stakeholders for feedback. This iterative process helps refine the policy, ensuring it's both comprehensive and practical.

Documenting policies clearly and comprehensively

After you've settled on a data retention policy, document it in a clear and accessible manner. This documentation should:

  • Define the different data categories.
  • Specify the retention period for each category.
  • Detail the procedures for data deletion or archival.
  • Highlight any exceptions or special cases.

Regularly review and update this documentation, especially when there are changes in business operations, regulatory environments, or technological infrastructures.

Implementing data retention in the Splunk platform

Effective data management within the Splunk platform revolves around understanding the lifecycle of data and configuring the platform to respect your established retention policies. To accomplish this, you must understand the various data bucket types in the Splunk platform and the tools available to tailor retention settings.

Configuring index settings for retention

The Splunk platform offers granular controls to define the behavior of each bucket type:

Utilize the indexes.conf file to set parameters like maxHotSpanSecs for hot buckets or frozenTimePeriodInSecs to determine the age at which data transitions to the frozen state.

The maxHotSpanSecs is an advanced setting that should be set with care and understanding of the characteristics of your data.

Remember to consider both the size and age of data when defining retention settings. Depending on your policies, either parameter might trigger a transition between bucket states.

Example Scenario

An organization has identified a specific use-case that requires a custom Splunk platform index configuration. The index, named myindex, will be receiving a data volume of approximately 5GB/day. Their retention requirements for this index are specific:

  • Hot Buckets: Data should only stay in the hot bucket for a maximum duration of 24 hours before transitioning to the warm bucket.
  • Warm Buckets: After leaving the hot bucket, data should reside in the warm bucket for 30 days.
  • Cold Buckets: Post the warm bucket phase, the data should transition into cold storage, where it will remain for another 30 days.
  • Frozen Data: After its stint in cold storage, instead of deleting the data, we intend to archive it for possible future needs.

To achieve this, your organization needs to customize the the Splunk platform indexes.conf file for myindex.

[myindex]
homePath = $Splunk_DB/myindex/db
coldPath = $Splunk_DB/myindex/colddb
thawedPath = $Splunk_DB/myindex/thaweddb
maxDataSize = 5000
frozenTimePeriodInSecs = 5184000  # 60 days (30 days in warm + 30 days in cold)
maxHotSpanSecs = 86400  # 24 hours
maxWarmDBCount = 30  # Since we have 5GB/day and want to retain for 30 days in warm
# Additional settings might be required depending on where you want to archive the data, such as:
coldToFrozenDir = /path/to/my/archive/location
coldToFrozenScript = /path/to/my/archive/script

Securely deleting data

Deleting data securely isn't just about making space or managing storage; it's about ensuring that sensitive information doesn't fall into the wrong hands or get misused post-deletion.

The imperative of secure data deletion

Every piece of data that flows into the Splunk platform might contain information of varying sensitivity levels. Whether it's personally identifiable information (PII), business intellectual property, or operational insights, after the decision to delete it is made, its removal should be thorough and irreversible. Secure data deletion is:

  • Mandatory for Compliance: Many regulatory standards, like GDPR, HIPAA, and others, mandate secure data erasure as part of their compliance requirements.
  • Critical for Privacy: In the case of data breaches, securely deleted data offers no value to malicious actors, ensuring further data safety.
  • Essential for Trust: Customers, stakeholders, and partners are more likely to trust organizations that take every facet of data security, including its deletion, seriously.

Data deletion in the Splunk platform

The Splunk platform offers a set of tools and configurations that assist in data management, including its deletion.

  • Bucket Deletion: Data in the Splunk platform resides in buckets based on its age and status (hot, warm, cold, etc.). Deleting buckets is one method to remove data, but this doesn't ensure secure erasure.
  • Expire and Delete: Configurations in indexes.conf allow for automatic data aging out and deletion after a set period, defined by the frozenTimePeriodInSecs setting.
  • Manual Deletion: The Splunk platform provides SPL commands like delete to remove indexed data. But remember, using the delete command doesn't remove the data physically from disk but marks it non-searchable.

Ensuring secure erasure

While the Splunk platform provides tools for data deletion, ensuring the data is irrecoverable post-deletion requires additional steps:

  • Overwriting Data: Secure deletion tools work by overwriting the storage space occupied by the data multiple times with random patterns, ensuring that data recovery tools cannot retrieve the original data.
  • Physical Destruction: For highly sensitive data, after secure deletion, organizations sometimes opt for physical destruction of storage devices.
  • Encryption: Data encrypted at rest ensures that even if someone could retrieve deleted data, they wouldn't decipher its content without the encryption key.

By understanding the tools and techniques available within the Splunk platform, coupled with industry best practices, organizations can confidently manage and, when necessary, erase data securely.

Monitoring and auditing data retention practices

Properly setting up data retention practices in the Splunk platform is the first step in a broader journey. To ensure the ongoing efficacy and compliance of these practices, active monitoring and periodic auditing is a must.

Alerts and reports on data age

To actively manage and oversee data retention, you can:

  • Set Up Alerts: Create specific alerts in the Splunk platform when data in an index reaches its maximum age limit or is about to. This proactive measure allows administrators to take timely action, be it data backup, transfer, or deletion.
  • Generate Age Reports: Create periodic reports that offer insights into the age of data across various indexes. These reports can help identify anomalies or deviations from the retention policy.

Periodic review of retention settings

Data retention is not a static operation, and as organizational needs evolve, so too should retention practices. You should employ the following review processes:

  • Scheduled Reviews: Organize regular reviews of your Splunk platform's indexes.conf and other retention-related configurations to ensure they align with the current policy and operational needs.
  • Change Management: Any changes to retention settings should be documented, with reasons for the change, the individuals involved, and the date of the change. This aids in audit trails and provides clarity for future reviews.

The act of retaining data in the Splunk platform should be as deliberate and standardized as the act of deleting it. With the tools and strategies at hand, administrators can ensure that data retention practices are transparent, traceable, and in alignment with both policy directives and operational necessities.

Helpful resources

This article is part of the Splunk Outcome Path, Enhancing data management and governance. Click into that path to find more ways to ensure data consistency, privacy, accuracy, and compliance, ultimately boosting overall operational effectiveness. 

In addition, these resources might help you implement the guidance provided in this article:

Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their license support plan. Engage the ODS team at ondemand@splunk.com if you would like assistance.