CEDA Python Style Guide

Python coding standards for cedanl repositories. Based on PEP 8 with CEDA-specific conventions for educational data analytics.

Ecosystem

Tooling

Tool	Purpose
`uv`	Package and dependency management (replaces pip, venv, pyenv)
`ruff`	Linting AND formatting (replaces flake8, black, isort)
`pytest`	Testing
`quarto`	Reports and documentation

Preferred packages

Domain	Package	Notes
DataFrames (large)	polars	Fast, memory-efficient, consistent API
DataFrames (interop)	pandas	When libraries require pandas input
Data cleaning	pyjanitor	`clean_names()`, chaining methods
Visualization	plotly	Interactive plots for dashboards
Visualization (static)	matplotlib, seaborn	For reports
Modeling	scikit-learn	Classification, regression, evaluation
Web interface	streamlit	Interactive app mode
CLI	typer	When command-line interface is needed
Console output	rich	Progress bars, formatted output
File formats	pyarrow, fastexcel	Parquet and Excel I/O
Configuration	pyyaml, tomli	YAML/TOML config parsing

See Principles §11 for general dependency selection criteria.

Package Structure

Every Python repo is an installable package. See Project Structure for full directory layouts per repo type.

pyproject.toml

Single source of truth for package metadata, dependencies, and tool configuration:

[project]
name = "project-name"
version = "0.1.0"
description = "Clear one-line description"
requires-python = ">=3.13"
dependencies = [
    "polars>=1.0",
    "streamlit>=1.40",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0",
    "ruff>=0.8",
]

[tool.uv]
cache-dir = "./.uv_cache"

[tool.ruff]
line-length = 100
target-version = "py313"

[tool.ruff.lint]
select = ["E", "F", "I", "N", "W", "UP"]

[tool.pytest.ini_options]
testpaths = ["tests"]

Accessing package metadata

Metadata files (CSVs, Excel lookups) live inside the package so they travel with it:

from importlib.resources import files

def load_variable_definitions():
    """Load variable definitions from package metadata."""
    path = files("project_name.metadata") / "variabelen.csv"
    return pl.read_csv(str(path), separator=";")

For this to work, include metadata/ inside src/project_name/ with an __init__.py.

main.py (CLI entrypoint)

"""Pipeline entrypoint for project-name."""
from project_name.prepare import prepare_data
from project_name.transform import transform_data
from project_name.analyze import run_analysis
from project_name.export import export_results


def main() -> None:
    df = prepare_data("data/01-raw/")
    df = transform_data(df)
    results = run_analysis(df)
    export_results(results, "data/03-output/")


if __name__ == "__main__":
    main()

Syntax

Type hints

Use type hints for function signatures. They serve as documentation and enable static analysis:

## Good
def transform_data(
    df: pl.DataFrame,
    target_year: int,
    min_credits: int = 0,
    include_masters: bool = False,
) -> pl.DataFrame:
    ...

## Bad
def transform_data(df, target_year, min_credits=0, include_masters=False):
    ...

All function parameters and return types annotated
Use str | None (union syntax) over Optional[str]
Use list[str] over List[str] (lowercase generics, Python 3.10+)
Don't annotate every local variable — only where it adds clarity

Docstrings

Use Google-style docstrings:

def process_chunk(
    positions: list[tuple[int, int]],
    chunk: list[str],
) -> list[str]:
    """Process a chunk of fixed-width lines into CSV format.

    Converts fixed-width lines into semicolon-delimited strings
    based on field position definitions.

    Args:
        positions: List of (start, end) tuples defining field boundaries.
        chunk: Lines to process, as strings or bytes.

    Returns:
        List of semicolon-delimited strings, one per input line.

    Example:
        >>> process_chunk([(0, 5), (5, 10)], ["abc  def  "])
        ['abc;def']
    """

First line: short summary (one sentence, imperative mood)
Body: additional context when needed
Args: every parameter documented
Returns: what comes back
Example: when the behavior isn't obvious

Data manipulation with Polars

Prefer Polars for data operations. Its API is consistent and explicit:

# Good: Polars method chaining
students = (
    df.filter(pl.col("INS_Studiejaar") >= 2020)
    .select(["INS_Studentnummer", "INS_Opleidingsnaam", "retentie"])
    .with_columns(
        pl.col("retentie").cast(pl.Int8).alias("retentie_int"),
    )
)

# When pandas is required (e.g., for scikit-learn input)
students_pd = students.to_pandas()

Use method chaining with parenthesized expressions
Explicit column selection with pl.col()
Use .alias() for derived columns
Convert to pandas only at the boundary where a library requires it

When to use Pandas

Libraries that require pandas input (scikit-learn, some plotting libraries)
Small, simple operations where Polars overhead isn't justified
When reading Excel files (via openpyxl/fastexcel, then convert to Polars)

Naming

Follow PEP 8:

# Good
def calculate_retention_rate(df: pl.DataFrame, year: int) -> float:
    enrollment_count = df.filter(pl.col("year") == year).height
    ...

class StudentDataPipeline:
    MAX_RETRY_COUNT = 3
    ...

# Bad
def calcRetRate(df, y):
    ec = df.filter(pl.col("year") == y).height
    ...

Functions and variables: snake_case
Classes: PascalCase
Constants: UPPER_SNAKE_CASE
Function names start with a verb: prepare_data(), validate_schema(), export_results()
Descriptive names — when in doubt, longer is better than short

Imports

# Good: organized imports
from __future__ import annotations

import json
from pathlib import Path

import polars as pl
from loguru import logger

from project_name.prepare import prepare_data
from project_name.metadata import load_variable_definitions

Standard library first, then third-party, then local (ruff handles this via isort rules)
Use from pathlib import Path — prefer Path over os.path
Import specific names, not entire modules (except for polars as pl, pandas as pd, numpy as np)

Error handling

# Good: specific errors, early validation
def load_input_data(path: Path) -> pl.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Input file not found: {path}")
    if path.suffix not in (".csv", ".parquet"):
        raise ValueError(f"Unsupported format: {path.suffix}")

    return pl.read_csv(path, separator=";")

# Bad: catch-all, no validation
def load_input_data(path):
    try:
        return pl.read_csv(str(path), separator=";")
    except Exception:
        return None

Validate inputs at function boundaries (guard clauses)
Raise specific exceptions with descriptive messages
Never catch bare Exception unless re-raising

Configuration

# Good: config separate from logic
import tomllib
from pathlib import Path

def load_config(config_path: Path = Path("app/config.toml")) -> dict:
    with open(config_path, "rb") as f:
        return tomllib.load(f)

config = load_config()
input_path = Path(config["paths"]["input"])

Use TOML for configuration (native Python 3.11+ support)
Keep configuration files in app/ (for Streamlit) or project root
Never hardcode file paths — use config or arguments
Use pathlib.Path for all path handling

Testing

# tests/test_transform.py
import polars as pl
import pytest
from project_name.transform import transform_data


@pytest.fixture
def sample_data():
    return pl.DataFrame({
        "INS_Studiejaar": [2023, 2024, 2024],
        "INS_Studentnummer": ["001", "002", "003"],
        "retentie": [1, 0, 1],
    })


def test_transform_filters_by_year(sample_data):
    result = transform_data(sample_data, target_year=2024)
    assert result.height == 2
    assert all(result["INS_Studiejaar"].to_list() == [2024, 2024])


def test_transform_raises_on_empty():
    empty = pl.DataFrame({"INS_Studiejaar": [], "retentie": []})
    with pytest.raises(ValueError, match="No data"):
        transform_data(empty, target_year=2024)

Test file mirrors source: src/project_name/transform.py → tests/test_transform.py
Use fixtures for test data
Test expected outputs, edge cases, and error conditions
Run with uv run pytest

Interactive App (Streamlit or Shiny for Python)

The app lives outside the package in app/. This is different from R, where the app lives inside the package (inst/app/). The reason: both streamlit run and shiny run require a file path, not a Python module.

Key conventions (both frameworks)

App in app/ directory — NOT inside src/project_name/
App contains NO business logic — only UI and calls to package functions
config.toml in app/ for local data paths
Keep the app thin — orchestration and presentation only

Dependency Management

pyproject.toml defines dependencies
uv.lock locks exact versions (commit to git)
.python-version pins the Python version
Don't commit .uv_cache/