When possible, I try to use real data for both volumetry and heterogeneity testi...

jimbokun · 2025-07-14T14:18:18 1752502698

This is very important and requires some foresight when the real data is personally identifiable information, private health information, etc.

It's possible, but requires designing a safe way to run pre-production code that touches production data. Which in practice means you better be sure you're only doing reads, not writes, and running your code in the production environment with all the same controls as your production code.

hamdouni · 2025-07-14T14:35:30 1752503730

You are right. I have a pre-production environment with a copy of production data and a script that scramble names and personal infos.

dirkc · 2025-07-14T13:27:20 1752499640

I try to do UX design with real data too. Not sure if that is what you mean with heterogeneity?

hamdouni · 2025-07-14T13:49:12 1752500952

Not quite UX-focused, but related

I meant data heterogeneity - the variety in formats, edge cases, and data quality you encounter in production. Real user data often has inconsistencies, missing fields, unexpected formats, etc. that synthetic test data tends to miss.

This helps surface integration issues and performance bottlenecks early.