Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
Master sensitive data handling with DuckDB in this guide to implementing data privacy techniques such as PII masking (includes practical Python code examples).
Join the DZone community and get the full member experience.
Join For FreeUnderstanding DuckDB for Data Privacy and Security
Data privacy and security have become critical for all organizations across the globe. Organizations often need to identify, mask, or remove sensitive information from their datasets while maintaining data utility. This article explores how to leverage DuckDB, an in-process analytical database, for efficient sensitive data remediation.
Why DuckDB? (And Why Should You Care?)
Think of DuckDB as SQLite's analytically gifted cousin. It's an embedded database that runs right in your process, but it's specifically designed for handling analytical workloads. What makes it perfect for data remediation? Well, imagine being able to process large datasets with lightning speed, without setting up a complicated database server. Sounds good, right?
Here's what makes DuckDB particularly awesome for our use case:
- It's blazing fast thanks to its column-oriented storage.
- You can run it right in your existing Python environment.
- It handles multiple file formats like it's no big deal.
- It plays nicely with cloud storage (more on that later).
In this guide, I'll be using Python along with DuckDB. DuckDB supports other languages, too, as mentioned in their documentation.
Getting Started With DuckDB for Data Privacy
Prerequisites
- Python 3.9 or higher installed
- Prior knowledge of setting up Python projects and virtual environments or Conda environments
Install DuckDB inside a virtual environment by running the following command:
pip install duckdb --upgrade
Now that you have installed DuckDB, let's create a DuckDB connection:
import duckdb
import pandas as pd
# Create a DuckDB connection - it's this simple!
conn = duckdb.connect(database=':memory:')
Advanced PII Data Masking Techniques
Here's how to implement robust PII (Personally Identifiable Information) masking:
Let's say you've got a dataset full of customer information that needs to be cleaned up. Here's how you can handle common scenarios.
Let's create sample data:
CREATE TABLE customer_data AS
SELECT
'John Doe' as name,
'123-45-6789' as ssn,
'john.doe@email.com' as email,
'123-456-7890' as phone;
- This creates a table called
customer_data
with one row of sample-sensitive data. - The data includes a name, SSN, email, and phone number.
The second part involves masking patterns using regexp_replace
:
-- Implement PII masking patterns
CREATE TABLE masked_data AS
SELECT
regexp_replace(name, '[a-zA-Z]', 'X') as masked_name,
regexp_replace(ssn, '[0-9]', '*') as masked_ssn,
regexp_replace(email, '(^[^@]+)(@.*$)', '****$2') as masked_email,
regexp_replace(phone, '[0-9]', '#') as masked_phone
FROM customer_data;
Let me walk you through what the above SQL code does.
regexp_replace(name, '[a-zA-Z]', 'X')
- Replaces all letters (both uppercase and lowercase) with
'X'
- Example:
"John Doe"
becomes"XXXX XXX"
- Replaces all letters (both uppercase and lowercase) with
regexp_replace(ssn, '[0-9]', '*') as masked_ssn
- Replaces all digits with
'*'
- Example:
"123-45-6789"
becomes"--***"
- Replaces all digits with
regexp_replace(email, '(^[^@]+)(@.*$)', '****$2') as masked_email:
(^[^@]+)
captures everything before the@
symbol(@.*$)
captures the@
and everything after it- Replaces the first part with
'****'
and keeps the domain part - Example:
""
becomes"****@email.com"
regexp_replace(phone, '[0-9]', '#') as masked_phone
:- Replaces all digits with
'#'
- Example:
"123-456-7890"
becomes"###-###-####"
- Replaces all digits with
So your data is transformed as below:
- Original data:
name: John Doe
ssn: 123-45-6789
email: john.doe@email.com
phone: 123-456-7890
- Masked data:
masked_name: XXXX XXX
masked_ssn: ***-**-****
masked_email: ****@email.com
masked_phone: ###-###-####
Python Implementation
import duckdb
import pandas as pd
def mask_pii_data():
# Create a DuckDB connection in memory
conn = duckdb.connect(database=':memory:')
try:
# Create and populate sample data
conn.execute("""
CREATE TABLE customer_data AS
SELECT
'John Doe' as name,
'123-45-6789' as ssn,
'john.doe@email.com' as email,
'123-456-7890' as phone
""")
# Implement PII masking
conn.execute("""
CREATE TABLE masked_data AS
SELECT
regexp_replace(name, '[a-zA-Z]', 'X') as masked_name,
regexp_replace(ssn, '[0-9]', '*') as masked_ssn,
regexp_replace(email, '(^[^@]+)(@.*$)', '****$2') as masked_email,
regexp_replace(phone, '[0-9]', '#') as masked_phone
FROM customer_data
""")
# Fetch and display original data
print("Original Data:")
original_data = conn.execute("SELECT * FROM customer_data").fetchdf()
print(original_data)
print("\n")
# Fetch and display masked data
print("Masked Data:")
masked_data = conn.execute("SELECT * FROM masked_data").fetchdf()
print(masked_data)
return original_data, masked_data
except Exception as e:
print(f"An error occurred: {str(e)}")
return None, None
finally:
# Close the connection
conn.close()
Data Redaction Based on Rules
Let me explain data redaction in simple terms before diving into its technical aspects.
Data redaction is the process of hiding or removing sensitive information from documents or databases while preserving the overall structure and non-sensitive content. Think of it like using a black marker to hide confidential information on a printed document, but in digital form.
Let's now implement Data Redaction with DuckDB and Python. I added this code snippet with comments so you can easily follow along.
import duckdb
import pandas as pd
def demonstrate_data_redaction():
# Create a connection
conn = duckdb.connect(':memory:')
# Create sample data with various sensitive information
conn.execute("""
CREATE TABLE sensitive_info AS SELECT * FROM (
VALUES
('John Doe', 'john.doe@email.com', 'CC: 4532-1234-5678-9012', 'Normal text'),
('Jane Smith', 'jane123@email.com', 'SSN: 123-45-6789', 'Some notes'),
('Bob Wilson', 'bob@email.com', 'Password: SecretPass123!', 'Regular info'),
('Alice Brown', 'alice.brown@email.com', 'API_KEY=abc123xyz', 'Basic text')
) AS t(name, email, sensitive_field, normal_text);
""")
# Define redaction rules
redaction_rules = {
'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', # Email pattern
'sensitive_field': r'(CC:\s*\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}|SSN:\s*\d{3}-\d{2}-\d{4}|Password:\s*\S+|API_KEY=\S+)', # Various sensitive patterns
'name': r'[A-Z][a-z]+ [A-Z][a-z]+' # Full name pattern
}
# Show original data
print("Original Data:")
print(conn.execute("SELECT * FROM sensitive_info").fetchdf())
# Apply redaction
redact_sensitive_data(conn, 'sensitive_info', redaction_rules)
# Show redacted data
print("\nRedacted Data:")
print(conn.execute("SELECT * FROM redacted_data").fetchdf())
return conn
def redact_sensitive_data(conn, table_name, rules):
"""
Redact sensitive data based on specified patterns.
Parameters:
- conn: DuckDB connection
- table_name: Name of the table containing sensitive data
- rules: Dictionary of column names and their corresponding regex patterns to match sensitive data
"""
redaction_cases = []
# This creates a CASE statement for each column
# If the pattern matches, the value is redacted
# If not, the original value is kept
for column, pattern in rules.items():
redaction_cases.append(f"""
CASE
WHEN regexp_matches({column}, '{pattern}')
THEN '(REDACTED)'
ELSE {column}
END as {column}
""")
query = f"""
CREATE TABLE redacted_data AS
SELECT
{', '.join(redaction_cases)}
FROM {table_name};
"""
conn.execute(query)
# Example with custom redaction patterns
def demonstrate_custom_redaction():
conn = duckdb.connect(':memory:')
# Create sample data
conn.execute("""
CREATE TABLE customer_data AS SELECT * FROM (
VALUES
('John Doe', '123-45-6789', 'ACC#12345', '$5000'),
('Jane Smith', '987-65-4321', 'ACC#67890', '$3000'),
('Bob Wilson', '456-78-9012', 'ACC#11111', '$7500')
) AS t(name, ssn, account, balance);
""")
# Define custom redaction rules with different patterns
custom_rules = {
'name': {
'pattern': r'[A-Z][a-z]+ [A-Z][a-z]+',
'replacement': lambda match: f"{match[0][0]}*** {match[0].split()[1][0]}***"
},
'ssn': {
'pattern': r'\d{3}-\d{2}-\d{4}',
'replacement': 'XXX-XX-XXXX'
},
'account': {
'pattern': r'ACC#\d{5}',
'replacement': 'ACC#*****'
}
}
def apply_custom_redaction(conn, table_name, rules):
redaction_cases = []
for column, rule in rules.items():
redaction_cases.append(f"""
CASE
WHEN regexp_matches({column}, '{rule['pattern']}')
THEN '{rule['replacement']}'
ELSE {column}
END as {column}
""")
query = f"""
CREATE TABLE custom_redacted AS
SELECT
{', '.join(redaction_cases)},
balance -- Keep this column unchanged
FROM {table_name};
"""
conn.execute(query)
# Show original data
print("\nOriginal Customer Data:")
print(conn.execute("SELECT * FROM customer_data").fetchdf())
# Apply custom redaction
apply_custom_redaction(conn, 'customer_data', custom_rules)
# Show results
print("\nCustom Redacted Data:")
print(conn.execute("SELECT * FROM custom_redacted").fetchdf())
# Run demonstrations
print("=== Basic Redaction Demo ===")
demonstrate_data_redaction()
print("\n=== Custom Redaction Demo ===")
demonstrate_custom_redaction()
Sample Results
Before redaction:
name email sensitive_field
John Doe john.doe@email.com CC: 4532-1234-5678-9012
After redaction:
name email sensitive_field
(REDACTED) (REDACTED) (REDACTEd)
Conclusion
DuckDB is a simple, yet powerful in-memory database that can help with sensitive data remediation.
Remember to always:
- Validate your masked data.
- Use parallel processing for large datasets.
- Take advantage of DuckDB's S3 integration for cloud data.
- Keep an eye on your memory usage when processing large files.
Opinions expressed by DZone contributors are their own.
Comments