De-identiying PII consistently with hashing in Edge Processor

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

The collection and use of Personally Identifiable Information (PII) can be both helpful and, at times, a bit risky. While this data can be used to fuel innovation and provide tailored services to customers, the increasing volume and sensitivity of PII in recent years comes with a number of privacy risks. At the same time, the regulatory landscape surrounding data privacy continues to grow more and more complex, with increasingly strict requirements being imposed on modern data handling practices.

These requirements create a challenge for many organizations: how can customer data be used for operational and analytical purposes without compromising individual privacy? The key lies in finding ways to de-identify PII, ensuring it remains useful for data analytics and business processes while reducing the risk of exposure and compliance breaches. This problem calls for innovative solutions bridging the gap between data utility and privacy, which, fortunately, Splunk is well-positioned to help with.

Solution

The Splunk Edge Processor, along with the application of cryptographic functions, plays a crucial role in safeguarding PII while preserving its utility. By using hashing algorithms to transform sensitive data into unique, fixed-size strings that are both consistently produced and one-way by design, this approach maintains uniformity across identical inputs and produces strings that are nearly impossible to reverse-engineer. This dual characteristic ensures that data remains obscure yet analytically valuable, allowing for the aggregation and analysis of personalized information without compromising individual privacy.

Caution should be exercised when using older algorithms like MD5 and SHA1, as they are notably susceptible to collisions and rainbow table attacks. These vulnerabilities make them less secure, potentially allowing attackers to reverse-engineer the originally-hashed data. Instead, it's advisable to use more robust algorithms like SHA256 or SHA512, which offer much stronger protection against such attacks.

One major benefit of Splunk Edge Processor is that it allows for the real-time application of cryptographic functions to your data as it flows within your on-premises environment. This means that PII can be obfuscated at the point of ingestion, significantly reducing the window of vulnerability to data breaches and/or unauthorized access. Processing data at the edge also minimizes the need to transmit sensitive information across networks, further enhancing security.

This approach aligns with privacy-by-design principles and helps comply with global data protection regulations such as GDPR and CCPA. By ensuring that PII is de-identified before storage or analysis, you can mitigate the legal and reputational risks associated with data privacy breaches while also retaining the usefulness of the data at hand.

To provide a more concrete example of what this process might look like, let’s take a closer look at a relatively common use case.

Use case: Enhancing patient privacy in healthcare analytics

Suppose a healthcare organization uses the Splunk platform to monitor and analyze logs that are generated from a range of medical devices. These logs are filled with sensitive patient information like names, medical record numbers, and treatment details. In order to meet Health Insurance Portability and Accountability Act (HIPAA) requirements, the organization must de-identify or anonymize this data before it's persisted and used for analysis.

For simplicity’s sake, we’ll be looking at only one type of log. It’s important, however, to note that the process outlined in the following steps can serve as a general guide for obfuscating PII across any and all ingested data. Here’s an example of a log that contains PII:

{
    "log_id": "123456789",
    "timestamp": "2024-01-01T12:00:00Z",
    "device_id": "device_01",
    "patient_info": {
        "name": "John Doe",
        "birthday": "2000-01-28",
        "medical_record_number": "MRN123456",
        "treatment_details": {
            "diagnosis": "Type 2 Diabetes",
            "treatment_plan": "Insulin Therapy",
            "prescription": "Insulin Glargine",
        }
    },
    "system_info": {
        "system_id": "system_05",
        "location": "Ward A - Room 101"
    }
}

In this instance, the log contains a patient's name, birthday, and medical record number—all of which must all be de-identified. Additionally, although the log’s treatment details are not considered direct PII, they are related to the patient’s health status and are categorized as Protected Health Information (PHI) under regulations like HIPAA. This means that the diagnosis, treatment plan, and prescription must also be de-identified. The rest of the log, however, does not contain any relevant PII and can remain as-is.

Now, let's explore the specifics of how you can use Splunk Edge Processor alongside cryptographic functions to process this log data.

To properly follow the steps below, you should have already installed and configured an Edge Processor instance to ingest, transform, and route logs to the desired target destination. This setup can be achieved by following the steps outlined in the first time setup instructions, then walking through the quick start guide as needed.
Also, you should understand the potential repercussions of modifying data before attempting to do so, as dependent tools and technologies—like searches and dashboards involving the unmodified data—may break in the process.

1. Create a pipeline for real-time data obfuscation

First, select the Pipelines tab on the leftmost side of the web UI, then click New Pipeline in the top-right corner. Depending on the fields present in the ingested data, the pipeline’s partition will be defined by either source, source type, or host. It doesn’t necessarily matter which of these is selected, though it’s critical that data sent through the Splunk Edge Processor contains whichever partition is chosen; otherwise, the pipeline will not be used to transform and route the data.

2. Use cryptographic functions to mask PII

After the new pipeline has been set up, use json_set alongside sha256 to obscure the sensitive fields identified previously (name, birthday, treatment info, etc). For reference, your SPL2 should look something like this:

1. $pipeline = | from $source
2.   | eval patient=_raw.patient_info
3.   | eval treatment=patient.treatment_details
4.   | eval _raw=json_set(_raw,
5.       "patient_info", json_set(patient,
6.         "name", sha256(patient.name),
7.         "birthday", sha256(patient.birthday),
8.         "medical_record_number", sha256(patient.medical_record_number),
9.         "treatment_details", json_set(treatment,
10.           "diagnosis", sha256(treatment.diagnosis),
11.           "treatment_plan", sha256(treatment.treatment_plan),
12.           "prescription", sha256(treatment.prescription)
13.         )
14.       )
15.     )
16.   | fields - patient, treatment
17.   | into $destination;

Search explanation

The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.

Splunk Search	Explanation
`$pipeline = \| from $source`	Specifies the data source from which the pipeline will read. This can be a dataset, a stream of events, logs, or any other data input that SPL2 supports. In this case, we’re dealing with the medical device logs mentioned previously.
`\| eval patient=_raw.patient_info` `\| eval treatment=patient.treatment_details`	Create temporary references to the log’s `patient_info` and `treatment_details` data. This intermediate step isn’t technically necessary, though it makes the de-identification process easier to parse in the following lines.
`\| eval _raw=json_set(_raw,` `"patient_info", json_set(patient,` `"name", sha256(patient.name),` `"birthday", sha256(patient.birthday),` `"medical_record_number", sha256(patient.medical_record_number),` `"treatment_details", json_set(treatment,` `"diagnosis", sha256(treatment.diagnosis),` `"treatment_plan", sha256(treatment.treatment_plan),` `"prescription", sha256(treatment.prescription)` `)` `)` `)`	Obfuscate the previously-identified PII, using `json_set` to recursively substitute each sensitive field with its hashed equivalent. In this case, we’ve opted to use the sha256 command in place of md5 or sha1 due to its enhanced security properties and collision-resistant nature. sha512 was also a good candidate; however, the lengthier hash it produces is unnecessarily large and can be omitted for space and efficiency’s sake. The output of `json_set` should be reassigned to the `_raw variable`; otherwise, the plaintext fields will still be present in the event since `json_set` doesn’t perform in-place modification.
`\| fields - patient, treatment`	Removes the temporary fields created previously from the outgoing event data. If this command is omitted from the pipeline, the plaintext information contained within these fields will be undesirably routed to the target destination.
`\| into $destination;`	Specifies the data’s target destination after it has been processed by the pipeline. In this case, the modified log is routed to `$destination`, which might be a database, a file, a dashboard, or any other SPL2-supported data sink.

And that’s it! In just a few lines, you’ve successfully complied with regulatory requirements while simultaneously retaining the utility of the data in question.

Next steps

The combination of Splunk Edge Processor and cryptographic functions offers a robust solution to the problem of balancing data utility with privacy concerns. It enables you to leverage the full potential of your data for innovation and decision making without sacrificing the privacy of individuals. This method not only strengthens data security but also builds trust with customers and stakeholders by demonstrating a commitment to privacy and regulatory compliance.

Join the #edge-processor Slack channel for direct support with Splunk Edge Processor (request access: http://splk.it/slack). Then, review the additional resources below to help you better understand and implement this use case:

Resource: Edge Processor resource hub
Docs: About the Splunk Edge Processor solution
Docs: Splunk Edge Processor pipeline syntax
Blog: Data preparation made easy: SPL2 for Splunk Edge Processor
Tech Talk: Introducing Splunk Edge Processor

You should also review these additional use cases on Splunk Lantern:

Implementing use cases in Splunk Edge Processor (including how to filter Kubernetes data over HEC, mask sensitive information, and modify raw events to remove fields)
Enriching data via real-time threat detection with KV Store lookups in Edge Processor
Reducing PAN and Cisco security firewall logs with Splunk Edge Processor
Routing root user events to a special index
Masking IP addresses from a specific range