CEDA Data Conventions
Standards for data handling, naming, formats, and privacy across all cedanl repositories.
Column Naming
Preserve source names
Keep the original column names from data sources and variables much as possible. This ensures traceability and makes it easy for domain experts to recognize fields.
New variables
When creating derived or computed variables, follow the naming conventions of the source data. In addition:
- Names should be descriptive and immediately understandable, when in doubt longer is better than short.
- Use underscores as separators
Suffix conventions for derived variables
| Suffix | Meaning | Example |
|---|---|---|
_cat |
Categorized/binned numeric | hoogste_vooropleiding_soort_cat |
_code |
Numeric code for a string value | vooropleiding_voor_ho_code |
_datum |
Date field | leeftijd_peildatum_1_oktober_datum |
_norm |
Standardized/normalized | INS_Score_norm |
Character rules
- No diacritics (no accents, umlauts, or special characters) in column names
- No spaces — use underscores
- Letters, digits, and underscores only
- descriptive case, object, statement style.
File Formats
Internal data (between pipeline steps)
Use Parquet for all intermediate data files within a repository. Parquet is: - Fast to read/write - Compact (compressed) - Type-preserving (no string-to-factor issues) - Language-agnostic (works in R and Python)
Output data (for consumers)
Provide both formats: - Parquet — for programmatic consumption by other repos - CSV — for users who need to open data in Excel or other tools
CSV conventions
- Delimiter:
;(semicolon) — standard for Dutch data, avoids conflicts with comma in decimal numbers - Encoding: UTF-8 for output files
- Quote all text fields that might contain delimiters
Always convert to UTF-8 during the ingestion step.
Data Directory Structure
data/
01-raw/ # Source files, never modified
02-prepared/ # Cleaned, decoded, intermediate Parquet files
03-output/ # Final output: Parquet + CSV
01-raw/: Place input files here. These are read-only — never overwrite source data.02-prepared/: Intermediate results between pipeline steps. Parquet format.03-output/: Final deliverables. Always both Parquet and CSV.
All data directories are in .gitignore. Include synthetic demo/subfolders datasets where possible (in data/) for exploratory use.
Privacy and Sensitive Data
Rules
- Never commit personal data (BSN, names, addresses) to git
- Auto-anonymize BSN and other identifiers in the ingestion step
- Sensitive fields are removed or hashed before data leaves
02-prepared/ - Document which fields are considered sensitive in
metadata/or the README
Data classification
| Level | Description | Handling |
|---|---|---|
| Public | Aggregated statistics, published data | Can be in repo (demo data) |
| Internal | Anonymized individual records | In .gitignore, share via secure channels |
| Sensitive | Contains BSN, names, or other PII | Must be anonymized before any processing output |
Documentation of Data Assumptions
Every repo that processes data should document:
- Expected input format — which fields, types, and sources
- Transformation logic — what changes and why (in code comments or metadata)
- Known data quality issues — missing values, known errors, workarounds
- Output schema — what the final output looks like
Place this documentation in:
- metadata/ directory (for lookup tables and variable definitions)
- Code comments (for transformation logic)
- README.md or CLAUDE.md (for high-level overview)
Data Dictionaries
Every repo that processes or produces data must include a data dictionary for its output datasets. Maintain two formats: machine-readable and human-readable.
Machine-readable: data_dictionary.csv
A ;-delimited CSV file (UTF-8) in the repo's metadata directory (see Project Structure for exact location per language).
| Kolom | Verplicht | Beschrijving |
|---|---|---|
dataset |
ja | Naam van de output dataset (bijv. instroom_2024) |
column_name |
ja | Kolomnaam zoals in de data |
description |
ja | Korte beschrijving van de variabele |
type |
ja | Datatype: character, integer, double, date, boolean |
source |
ja | Herkomst: bronsysteem of derived voor berekende variabelen |
example |
nee | Voorbeeldwaarde |
allowed_values |
nee | Toegestane waarden of bereik (bijv. 1-5, MBO;HBO;WO) |
sensitive |
nee | true als het veld persoonsgegevens bevat |
Human-readable: rendered documentation
Generate a readable version from data_dictionary.csv, for example:
- Quarto rendered to HTML (recommended — fits existing CEDA workflow)
- R:
DT::datatable()orgt::gt()in a Quarto vignette - Python: table in a Streamlit app page
Location
The data dictionary lives alongside other metadata files, following existing package conventions:
- R repos:
inst/metadata/data_dictionary.csv(installed with the package) - Python repos:
src/metadata/data_dictionary.csv(accessible viaimportlib.resources)
The human-readable version can be a vignette (R) or a page in the app (Shiny/Streamlit).
Maintenance
- Update the data dictionary when columns are added, removed, or renamed