19
.
08
.
2025
18
.
08
.
2025
Postgresql
Software

Tradeoffs of Anonymising Production Data

Michał Łęcicki
Ruby Developer
Tradeoffs of Anonymising Production Data by Michał Łęcicki

At some point, in every Rails project, someone says: 'Hey, our testing dataset is not relevant anymore, maybe we could anonymise production data and use it instead?'. That's when a tough journey begins: ensuring data privacy, updating scripts, and securing infrastructure. Let's explore the challenges you might face if you decide to anonymise your production data.

What even is data anonymisation?

Data anonymisation means stripping your dataset of all Personally Identifiable Information (PII). Under the GDPR, it includes names, addresses, dates, and financial information that can be used to identify a person. Each piece of information has a different level of privacy and security. For example, a credit card number is more sensitive than a company name. But safe to say: if someone steals your anonymised database and can't track original records out of it, you are good. This means you need to provide a certain level of anonymisation. If data is reversible, it's called pseudo-anonymisation. Which is not GDPR compliant, so you'd rather avoid it.

Reasons to anonymise

It all starts with identifying your actual goal.

  • Are some unreproducible bugs happening in production?
  • Do you need to run performance tests on a production-like dataset?
  • Or, you want to test your data migrations without any complications?

The list of reasons can be long, and each project has its unique conditions. Often, data anonymisation is just one of several possible strategies. Taking one step back might help you find simpler alternatives.

For instance, using monitoring tools like New Relic can help with checking performance. Debugging data issues in production is much the same. Using Active Admin or QuickSight for observability can remove the need for an anonymised database. With tools like Active Admin, you can inspect live production data directly in a controlled environment - no need to download or copy it elsewhere**.** Or, dummy-data generators can solve the problem of a low volume of data. Data anonymisation is one of the options, usually not the best one for you. Why consider data anonymisation alternatives? Because of:

The cost of data anonymisation

Data anonymisation is expensive on different levels. Let's reveal them.

First: data security and privacy. You need to make sure anonymised data does not contain any PII. You need to remove all addresses, real names, birthdates, bank accounts, and more from each table with great attention. But consider also less obvious fields: record created_at or updated_at timestamps, links to files, API keys, internal notes, and slugs. If you want truly anonymised data, the script needs to be extremely precise - that's hard to achieve.

Second: extra complexity in your infrastructure. Data anonymisation means copying the real production database. The process must run in the production ecosystem - downloading raw data to anonymise it locally would defeat the entire purpose. It’s a sensitive operation that requires extra care. It also adds another brick to your infrastructure - someone has to maintain it and ensure it stays secure.

Third: maintenance and updates. You have a perfect script and elegant infrastructure to invoke it, but your work is not done. Every update to the data structure requires checking if the script should also be updated! This step might be partially automated with CI, but the anonymisation script is still something to remember after any code changes. Extra effort and a huge risk of mistakes.

Data anonymisation and AI

In the modern era of agentic coding and MCP servers, a new question arises. What if you want to connect AI to your production database to get some insights or solve some problems? Since your trust is limited, you want to anonymise the data first. That's a fair point, but using AI on the anonymised database won't work as well as with real data. Why? Because truly anonymised data is no longer production data. Names, dates, null values, and long text notes are all different.

Want AI to help you fix tricky production bugs? Keep in mind that anonymisation greatly alters the dataset. Bugs can be unreproducible anymore. The same problem applies to insights about your database - anonymised data can behave differently.  But there is good news: you can find details about missing indexes, performance tests, and table structures using other tools. You don’t need a real production database to do this.

Summary

Data anonymisation has always been important. However, in the modern era of data leaks and AI learning from datasets, it is even more significant. Explore other solutions first! Data anonymisation comes with high maintenance costs, a lot of effort, and uneven risks. Treat it as a last resort. And if you choose to do so, be extra careful about its security and maintenance.

Michał Łęcicki
Ruby Developer

Check my Twitter

Check my Linkedin

Did you like it? 

Sign up To VIsuality newsletter

READ ALSO

Tradeoffs of Anonymising Production Data by Michał Łęcicki

Tradeoffs of Anonymising Production Data

11
.
06
.
2025
Michał Łęcicki
Postgresql
Software
Ruby MCP Client in Rails by Paweł Strzałkowski

MCP Client in Rails using ruby-mcp-client gem

11
.
06
.
2025
Paweł Strzałkowski
LLM
Ruby on Rails
Actionmcp in Ruby on Rails by Paweł Strzałkowski

MCP Server with Rails and ActionMCP

11
.
06
.
2025
Paweł Strzałkowski
LLM
Ruby on Rails
Banner - MCP Server with FastMCP and Rails by Paweł Strzałkowski

MCP Server with Rails and FastMCP

11
.
06
.
2025
Paweł Strzałkowski
LLM
Ruby
Ruby on Rails

Ruby on Rails and Model Context Protocol

11
.
06
.
2025
Paweł Strzałkowski
Ruby on Rails
LLM
Title image

Highlights from wroclove.rb 2025

24
.
07
.
2025
Kaja Witek
Conferences
Ruby
Jarosław Kowalewski - Migration from Heroku using Kamal

Migration from Heroku using Kamal

11
.
06
.
2025
Jarosław Kowalewski
Backend
store-vs-store_accessor by Michał Łęcicki

Active Record - store vs store_accessor

11
.
06
.
2025
Michał Łęcicki
Ruby
Ruby on Rails
How to become a Ruby Certified Programmer Title image

How to become a Ruby Certified Programmer

11
.
06
.
2025
Michał Łęcicki
Ruby
Visuality
Vector Search in Ruby - Paweł Strzałkowski

Vector Search in Ruby

11
.
06
.
2025
Paweł Strzałkowski
ChatGPT
Embeddings
Postgresql
Ruby
Ruby on Rails
LLM Embeddings in Ruby - Paweł Strzałkowski

LLM Embeddings in Ruby

11
.
06
.
2025
Paweł Strzałkowski
Ruby
LLM
Embeddings
ChatGPT
Ollama
Handling Errors in Concurrent Ruby, Michał Łęcicki

Handling Errors in Concurrent Ruby

11
.
06
.
2025
Michał Łęcicki
Ruby
Ruby on Rails
Tutorial
Recap of Friendly.rb 2024 conference

Insights and Inspiration from Friendly.rb: A Ruby Conference Recap

24
.
07
.
2025
Kaja Witek
Conferences
Ruby on Rails

Covering indexes - Postgres Stories

11
.
06
.
2025
Jarosław Kowalewski
Ruby on Rails
Postgresql
Backend
Ula Sołogub - SQL Injection in Ruby on Rails

The Deadly Sins in RoR security - SQL Injection

11
.
06
.
2025
Urszula Sołogub
Backend
Ruby on Rails
Software
Michal - Highlights from Ruby Unconf 2024

Highlights from Ruby Unconf 2024

11
.
06
.
2025
Michał Łęcicki
Conferences
Visuality
Cezary Kłos - Optimizing Cloud Infrastructure by $40 000 Annually

Optimizing Cloud Infrastructure by $40 000 Annually

11
.
06
.
2025
Cezary Kłos
Backend
Ruby on Rails

Smooth Concurrent Updates with Hotwire Stimulus

11
.
06
.
2025
Michał Łęcicki
Hotwire
Ruby on Rails
Software
Tutorial

Freelancers vs Software house

11
.
06
.
2025
Michał Krochecki
Visuality
Business

Table partitioning in Rails, part 2 - Postgres Stories

11
.
06
.
2025
Jarosław Kowalewski
Backend
Postgresql
Ruby on Rails

N+1 in Ruby on Rails

11
.
06
.
2025
Katarzyna Melon-Markowska
Ruby on Rails
Ruby
Backend

Turbo Streams and current user

11
.
06
.
2025
Mateusz Bilski
Hotwire
Ruby on Rails
Backend
Frontend