CEDA Python Style Guide
Python coding standards for cedanl repositories. Based on PEP 8 with CEDA-specific conventions for educational data analytics.
Ecosystem
Tooling
| Tool | Purpose |
|---|---|
uv |
Package and dependency management (replaces pip, venv, pyenv) |
ruff |
Linting AND formatting (replaces flake8, black, isort) |
pytest |
Testing |
quarto |
Reports and documentation |
Preferred packages
| Domain | Package | Notes |
|---|---|---|
| DataFrames (large) | polars | Fast, memory-efficient, consistent API |
| DataFrames (interop) | pandas | When libraries require pandas input |
| Data cleaning | pyjanitor | clean_names(), chaining methods |
| Visualization | plotly | Interactive plots for dashboards |
| Visualization (static) | matplotlib, seaborn | For reports |
| Modeling | scikit-learn | Classification, regression, evaluation |
| Web interface | streamlit | Interactive app mode |
| CLI | typer | When command-line interface is needed |
| Console output | rich | Progress bars, formatted output |
| File formats | pyarrow, fastexcel | Parquet and Excel I/O |
| Configuration | pyyaml, tomli | YAML/TOML config parsing |
See Principles §11 for general dependency selection criteria.
Package Structure
Every Python repo is an installable package. See Project Structure for full directory layouts per repo type.
pyproject.toml
Single source of truth for package metadata, dependencies, and tool configuration:
[project]
name = "project-name"
version = "0.1.0"
description = "Clear one-line description"
requires-python = ">=3.13"
dependencies = [
"polars>=1.0",
"streamlit>=1.40",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0",
"ruff>=0.8",
]
[tool.uv]
cache-dir = "./.uv_cache"
[tool.ruff]
line-length = 100
target-version = "py313"
[tool.ruff.lint]
select = ["E", "F", "I", "N", "W", "UP"]
[tool.pytest.ini_options]
testpaths = ["tests"]
Accessing package metadata
Metadata files (CSVs, Excel lookups) live inside the package so they travel with it:
from importlib.resources import files
def load_variable_definitions():
"""Load variable definitions from package metadata."""
path = files("project_name.metadata") / "variabelen.csv"
return pl.read_csv(str(path), separator=";")
For this to work, include metadata/ inside src/project_name/ with an __init__.py.
main.py (CLI entrypoint)
"""Pipeline entrypoint for project-name."""
from project_name.prepare import prepare_data
from project_name.transform import transform_data
from project_name.analyze import run_analysis
from project_name.export import export_results
def main() -> None:
df = prepare_data("data/01-raw/")
df = transform_data(df)
results = run_analysis(df)
export_results(results, "data/03-output/")
if __name__ == "__main__":
main()
Syntax
Type hints
Use type hints for function signatures. They serve as documentation and enable static analysis:
## Good
def transform_data(
df: pl.DataFrame,
target_year: int,
min_credits: int = 0,
include_masters: bool = False,
) -> pl.DataFrame:
...
## Bad
def transform_data(df, target_year, min_credits=0, include_masters=False):
...
- All function parameters and return types annotated
- Use
str | None(union syntax) overOptional[str] - Use
list[str]overList[str](lowercase generics, Python 3.10+) - Don't annotate every local variable — only where it adds clarity
Docstrings
Use Google-style docstrings:
def process_chunk(
positions: list[tuple[int, int]],
chunk: list[str],
) -> list[str]:
"""Process a chunk of fixed-width lines into CSV format.
Converts fixed-width lines into semicolon-delimited strings
based on field position definitions.
Args:
positions: List of (start, end) tuples defining field boundaries.
chunk: Lines to process, as strings or bytes.
Returns:
List of semicolon-delimited strings, one per input line.
Example:
>>> process_chunk([(0, 5), (5, 10)], ["abc def "])
['abc;def']
"""
- First line: short summary (one sentence, imperative mood)
- Body: additional context when needed
- Args: every parameter documented
- Returns: what comes back
- Example: when the behavior isn't obvious
Data manipulation with Polars
Prefer Polars for data operations. Its API is consistent and explicit:
# Good: Polars method chaining
students = (
df.filter(pl.col("INS_Studiejaar") >= 2020)
.select(["INS_Studentnummer", "INS_Opleidingsnaam", "retentie"])
.with_columns(
pl.col("retentie").cast(pl.Int8).alias("retentie_int"),
)
)
# When pandas is required (e.g., for scikit-learn input)
students_pd = students.to_pandas()
- Use method chaining with parenthesized expressions
- Explicit column selection with
pl.col() - Use
.alias()for derived columns - Convert to pandas only at the boundary where a library requires it
When to use Pandas
- Libraries that require pandas input (scikit-learn, some plotting libraries)
- Small, simple operations where Polars overhead isn't justified
- When reading Excel files (via openpyxl/fastexcel, then convert to Polars)
Naming
Follow PEP 8:
# Good
def calculate_retention_rate(df: pl.DataFrame, year: int) -> float:
enrollment_count = df.filter(pl.col("year") == year).height
...
class StudentDataPipeline:
MAX_RETRY_COUNT = 3
...
# Bad
def calcRetRate(df, y):
ec = df.filter(pl.col("year") == y).height
...
- Functions and variables:
snake_case - Classes:
PascalCase - Constants:
UPPER_SNAKE_CASE - Function names start with a verb:
prepare_data(),validate_schema(),export_results() - Descriptive names — when in doubt, longer is better than short
Imports
# Good: organized imports
from __future__ import annotations
import json
from pathlib import Path
import polars as pl
from loguru import logger
from project_name.prepare import prepare_data
from project_name.metadata import load_variable_definitions
- Standard library first, then third-party, then local (ruff handles this via
isortrules) - Use
from pathlib import Path— preferPathoveros.path - Import specific names, not entire modules (except for
polars as pl,pandas as pd,numpy as np)
Error handling
# Good: specific errors, early validation
def load_input_data(path: Path) -> pl.DataFrame:
if not path.exists():
raise FileNotFoundError(f"Input file not found: {path}")
if path.suffix not in (".csv", ".parquet"):
raise ValueError(f"Unsupported format: {path.suffix}")
return pl.read_csv(path, separator=";")
# Bad: catch-all, no validation
def load_input_data(path):
try:
return pl.read_csv(str(path), separator=";")
except Exception:
return None
- Validate inputs at function boundaries (guard clauses)
- Raise specific exceptions with descriptive messages
- Never catch bare
Exceptionunless re-raising
Configuration
# Good: config separate from logic
import tomllib
from pathlib import Path
def load_config(config_path: Path = Path("app/config.toml")) -> dict:
with open(config_path, "rb") as f:
return tomllib.load(f)
config = load_config()
input_path = Path(config["paths"]["input"])
- Use TOML for configuration (native Python 3.11+ support)
- Keep configuration files in
app/(for Streamlit) or project root - Never hardcode file paths — use config or arguments
- Use
pathlib.Pathfor all path handling
Testing
# tests/test_transform.py
import polars as pl
import pytest
from project_name.transform import transform_data
@pytest.fixture
def sample_data():
return pl.DataFrame({
"INS_Studiejaar": [2023, 2024, 2024],
"INS_Studentnummer": ["001", "002", "003"],
"retentie": [1, 0, 1],
})
def test_transform_filters_by_year(sample_data):
result = transform_data(sample_data, target_year=2024)
assert result.height == 2
assert all(result["INS_Studiejaar"].to_list() == [2024, 2024])
def test_transform_raises_on_empty():
empty = pl.DataFrame({"INS_Studiejaar": [], "retentie": []})
with pytest.raises(ValueError, match="No data"):
transform_data(empty, target_year=2024)
- Test file mirrors source:
src/project_name/transform.py→tests/test_transform.py - Use fixtures for test data
- Test expected outputs, edge cases, and error conditions
- Run with
uv run pytest
Interactive App (Streamlit or Shiny for Python)
The app lives outside the package in app/. This is different from R, where the app lives inside the package (inst/app/). The reason: both streamlit run and shiny run require a file path, not a Python module.
Key conventions (both frameworks)
- App in
app/directory — NOT insidesrc/project_name/ - App contains NO business logic — only UI and calls to package functions
config.tomlinapp/for local data paths- Keep the app thin — orchestration and presentation only
Dependency Management
pyproject.tomldefines dependenciesuv.locklocks exact versions (commit to git).python-versionpins the Python version- Don't commit
.uv_cache/