CEDA Project Structure

How cedanl repositories are organized, by type and language.

Repository Types

Type 1: Ingestion Repository

Transforms raw data sources (DUO fixed-width files, CSVs, APIs) into clean, research-ready formats. Other repos depend on ingestion repo output.

Type 2: Analysis / Use-Case Repository

Focuses on analysis (and therefore includes specific preparation). Produces data enriched with predictions, analytical results, reports, and visualizations.

Type 3: Template Repository

Starting points for new projects. Contains devcontainer configurations, boilerplate, and example code.

Type 4: Integration / Dashboard Repository (future)

Combines outputs from multiple repos into unified "management" interfaces or dashboards. Runs on SURF Developer Platform. This needs to be explored further.

Shared Standards (All Repos)

Required files

File	Purpose
`README.md`	What the repo does, how to install, how to run
`LICENSE`	MIT license
`.gitignore`	Language-appropriate ignores + data files
`CLAUDE.md`	AI assistant context, references org standards
`.devcontainer/`	Reproducible dev environment

README Guidelines

The README is the project's front page — focused on relevance and usage instructions, not on technical repo structure.

Language: Dutch. Code, variable names, and technical terms may be in English.

Structure:

Title and short description — one sentence explaining what the project does and for whom
Visual example — screenshot, GIF, or diagram showing the end result (dashboard, report, table). Not decorative but functional: it should make clear what the project delivers
Relevance / context — why this project exists, what problem it solves, who the target audience is
Quick start — steps to install and run the project (dependencies, data, commands)
Data — what input is needed, where output goes, reference to data dictionary
References — link to other relevant files with more background on e.g. the technical architecture or other documentation
Contact / contributing — who is responsible, how to contribute

What does not belong in the README: - Detailed project structure (directory tree) → move to CLAUDE.md - Technical architecture documentation → move to CLAUDE.md or vignettes/

Data directories

See Data Conventions for full details on data directory structure, file formats, and naming.

In short: data/01-raw/, data/02-prepared/, data/03-output/ with numbered prefixes per pipeline step. Data directories are in .gitignore, but each numbered folder has a /demo subfolder with synthetic data committed to git — making the repo a standalone working product.

CLAUDE.md template

Every repo should have a CLAUDE.md that includes:

# Project Name

## Overview
Brief description of what this repo does and which pipeline stage it covers.

## Standards
Follow CEDA technical standards: https://github.com/cedanl/.github/tree/main/standards/README.md

## Tech Stack
Language, key packages, tooling.

## Project Structure
Directory layout with brief descriptions.

## How to Run
Commands to install dependencies and run the pipeline / app.

## Data
Input/output formats, where data comes from, privacy notes.

Type 1: Ingestion Repository

R Variant

project-name/
├── .devcontainer/
│   └── devcontainer.json
├── .github/
│   └── workflows/
├── R/                          # Package functions
│   ├── ingest_source.R         # Read raw files
│   ├── decode_fields.R         # Parse fixed-width, apply metadata
│   ├── validate_data.R         # Quality checks
│   └── export_data.R           # Write Parquet + CSV output
├── inst/
│   ├── app/                    # Shiny app for interactive mode
│   │    ├── app.R
│   │    └── config.yml         # Fixed config for local data paths
│   └── metadata/               # Mapping tables, reference data, data dictionary
├── data/
│   ├── 01-raw/
│   ├── 02-prepared/
│   └── 03-output/
├── man/                        # Auto-generated roxygen2 docs
├── tests/
│   └── testthat/
├── main.R                      # Pipeline orchestration script
├── DESCRIPTION
├── NAMESPACE
├── CLAUDE.md
├── README.md
├── LICENSE
├── .gitignore
└── renv.lock

Key conventions: - Functions in R/ are the package — named as verb_object.R - main.R sources the package and runs the pipeline end-to-end - Shiny app in inst/app/ wraps package functions with a UI - config.yml in the app defines local data paths — users only need to set paths once - renv manages dependencies, air for code formatting

Python Variant

project-name/
├── .devcontainer/
│   └── devcontainer.json
├── .github/
│   └── workflows/
├── src/
│   └── project_name/           # Package (pip-installable)
│       ├── __init__.py
│       ├── ingest.py           # Read raw files
│       ├── decode.py           # Parse and transform
│       ├── validate.py         # Quality checks
│       ├── export.py           # Write Parquet + CSV output
│       └── metadata/           # Mapping tables, reference data, data dictionary
│           ├── __init__.py
│           └── field_definitions.csv
├── app/                        # Streamlit app for interactive mode
│   ├── main.py                 # Streamlit entrypoint
│   ├── pages/                  # Multi-page app
│   └── config.toml             # Fixed config for local data paths
├── .streamlit/
│   └── config.toml             # Streamlit settings
├── data/
│   ├── 01-raw/
│   ├── 02-prepared/
│   └── 03-output/
├── tests/
├── pyproject.toml              # Package definition + uv config
├── CLAUDE.md
├── README.md
├── LICENSE
├── .gitignore
├── .python-version
└── uv.lock

Key conventions: - Package code lives in src/project_name/ — installable via uv pip install -e . - Metadata (mapping tables, field definitions) lives inside the package at src/project_name/metadata/ — accessible via importlib.resources - app/main.py is the Streamlit entrypoint — wraps package functions - uv for dependency management, ruff for linting/formatting - config.toml in the app defines local data paths

Type 2: Analysis / Use-Case Repository

R Variant

project-name/
├── .devcontainer/
│   └── devcontainer.json
├── R/                          # Package functions
│   ├── prepare_data.R          # Load and merge input data
│   ├── transform_data.R        # Feature engineering, enrichment
│   ├── run_analysis.R          # Core analysis / modeling
│   ├── create_plots.R          # Visualization functions
│   └── render_report.R         # Report generation
├── inst/
│   ├── app/                    # Shiny app for interactive mode
│   │   ├── app.R
│   │   └── config.yml
│   ├── metadata/               # Mapping tables, reference data, data dictionary
│   └── templates/              # Report templates (Quarto/Rmd)
├── metadata/                   # Lookup tables, variable definitions
├── data/
│   ├── 01-raw/                 # Input from ingestion repos
│   ├── 02-prepared/
│   └── 03-output/
├── man/
├── tests/
│   └── testthat/
├── vignettes/                  # Usage documentation
├── main.R                      # Pipeline orchestration
├── DESCRIPTION
├── NAMESPACE
├── CLAUDE.md
├── README.md
├── LICENSE
├── .gitignore
└── renv.lock

Key conventions: - Input data comes from ingestion repos (placed in data/01-raw/) - metadata/ holds lookup tables, variable definitions, configuration - Report templates (Quarto .qmd or Rmarkdown) live in inst/templates/ - Shiny app lets users select parameters and run the analysis interactively

Python Variant

project-name/
├── .devcontainer/
│   └── devcontainer.json
├── src/
│   └── project_name/
│       ├── __init__.py
│       ├── prepare.py          # Load and merge input data
│       ├── transform.py        # Feature engineering
│       ├── analyze.py          # Core analysis / modeling
│       ├── visualize.py        # Plotting functions
│       ├── export.py           # Output generation
│       └── metadata/           # Lookup tables, variable definitions, data dictionary
│           ├── __init__.py
│           ├── variabelen.csv
│           └── levels.csv
├── app/                        # Streamlit app
│   ├── main.py
│   ├── pages/
│   └── config.toml
├── .streamlit/
│   └── config.toml
├── data/
│   ├── 01-raw/
│   ├── 02-prepared/
│   └── 03-output/
├── tests/
├── pyproject.toml
├── CLAUDE.md
├── README.md
├── LICENSE
├── .gitignore
├── .python-version
└── uv.lock

Key conventions: - Same package + app pattern as ingestion repos - Metadata (lookup tables, variable definitions) lives inside the package at src/project_name/metadata/ — accessible via importlib.resources - Streamlit app lets users configure analysis parameters and view results

Type 3: Template Repository

Templates provide starting points. They contain the earlier discussed directories.