Fonet HBYS data anonymizer

The problem

Every retrospective study I run draws on the same raw data: the emergency-department record exports produced by the hospital’s HBYS system (Fonet). These exports can’t be used for research in their raw form — they contain national ID numbers, names, phone numbers, addresses. Research use requires stripping those fields, coarsening dates of birth to the year, and converting doctor names to pseudonyms. Doing this by hand for every study is both tedious and error-prone; forgetting one identifying field once means leaving real identifying information in a dataset everyone believes is de-identified.

On top of that, Fonet’s exports are technically awkward: they arrive in the legacy BIFF .xls format, Turkish character encoding can come through garbled, and when you request more than a single month of data the file exceeds its row limit and splits across multiple sheets — some of which don’t reprint the header row.

What I built — and why the first version failed

I wrote the first version as a Flask web tool. It worked — but my physician colleagues with no software background couldn’t use it, because running it required setting up a Python environment. pip install, virtual environments, the command line — these aren’t an invisible barrier to me, but they are to a specialist physician with no software experience. The tool worked technically and was useless in practice, because “in practice” includes the question of whether the intended user can open it.

So I rewrote the tool from scratch as a Windows desktop application, in C# and WPF. A single packaged executable — double-click, it opens. No Python, no installation, no command line. I didn’t write something easier in a more familiar language; I wrote something more complex in a less familiar one (I learned C# for this project) — but the choice that was easy for me was the impossible choice for them.

The new version ingests the files, strips the identifying fields, coarsens dates of birth to the year, converts doctor names to pseudonyms like Dr_1, Dr_2, and produces a CSV ready for analysis in R or Python. It also has a summary dashboard — diagnosis distribution, age histogram, admission hours, lengths of stay — and clicking any element of those charts cross-navigates to a row view filtered by that element.

What was technically interesting

Two things.

The first is a data-integrity rule: you can’t de-identify before you derive. The dataset the tool produces includes, as one of its columns, how many times a patient was admitted that month. Computing that number requires grouping patients by national ID — but the national ID is the first thing the de-identification step removes. So any per-patient computation has to happen while the identifying information is still in memory; you can no longer group over the de-identified output, because the grouping key has been deleted. It’s a rule that forces an ordering: derive first, de-identify second. It sounds simple, but doing it the other way around is very easy, and when you do, the error is silent — the code runs, the output is wrong.

The second is a silent data corruption in Fonet’s multi-sheet exports. Fonet splits exports that exceed the legacy Excel format’s 65,535-row sheet limit across multiple sheets — but it doesn’t reprint the header row on all of them. Under the Excel reader library’s default setting, rows on the header-less sheets were being read as records with every field null: data loss, but without throwing an error. The fix is to read the header from the first sheet only and apply it to every sheet, while also separately detecting the case where the header is reprinted and skipping that row.

The first version’s code is still public: data_extractor_tool. It stands as a working Flask tool — but it’s no longer used, and why it’s no longer used is this project’s real lesson.

Outcome

The C# version is used as the data pipeline for ongoing retrospective studies. My physician colleagues can now extract their own datasets themselves — without having me set up a Python environment for them, just by double-clicking a file.

This project taught me that “works on my machine” and “works on my colleagues’ machines” are not the same statement. The single-file HTML tools in this portfolio — the consultation generator, the scheduling tools — were designed already knowing this lesson. This project is where I learned it.