Ga naar inhoud

API-referentie

De publieke API van eencijferho is beschikbaar via directe import:

from eencijferho import run_turbo_convert_pipeline
from eencijferho import process_txt_folder, write_variable_metadata

Alle publieke functies staan in __all__.


Pipeline

eencijferho.core.pipeline.run_turbo_convert_pipeline(storage, input_dir='data/01-input', dec_metadata_json=None, output_dir='data/02-output', metadata_dir=None, progress_callback=None, status_callback=None, output_config=None)

Run the full turbo-convert pipeline.

Parameters:

Name Type Description Default
input_dir str

Folder containing input fixed-width / CSV files.

'data/01-input'
dec_metadata_json str | None

Path to the DEC bestandsbeschrijving JSON. When None it is derived automatically from metadata_dir.

None
output_dir str

Folder where converted files are written.

'data/02-output'
metadata_dir str | None

Folder used for metadata (JSON, Excel, logs). When None it defaults to data/00-metadata (legacy behaviour).

None
progress_callback Callable[[int], None] | None

Optional callable(int) for progress 0-100.

None
status_callback Callable[[str], None] | None

Optional callable(str) for status messages.

None
output_config OutputConfig | None

Controls which output variants are produced. When None the defaults from :class:OutputConfig are used (decoded + enriched + parquet + encrypt + snake_case).

None

Extractor

eencijferho.core.extractor.process_txt_folder(storage, input_folder, json_output_folder='data/00-metadata/json')

Finds all .txt files containing 'Bestandsbeschrijving' in the root directory and extracts tables from them. Also processes all .asc files in the root directory.

Parameters:

Name Type Description Default
input_folder str

Folder to search for .txt and .asc files.

required
json_output_folder str

Output folder for JSON files. Defaults to 'data/00-metadata/json'.

'data/00-metadata/json'

Returns:

Type Description
list[str]

list[str]: Paths to all extracted JSON files.

Edge Cases
  • Removes any existing JSON files in the output folder.
  • Handles missing input folder gracefully.
Example

process_txt_folder('data/01-input')

eencijferho.core.extractor.write_variable_metadata(storage, input_dir='data/01-input', json_folder='data/00-metadata/json', output_filename='variable_metadata.json')

Scans all Bestandsbeschrijving*.txt files (recursively) from input_dir and writes a consolidated variable metadata JSON.

Parameters:

Name Type Description Default
input_dir str

Folder to search for input text files. Defaults to 'data/01-input'.

'data/01-input'
json_folder str

Output folder for JSON file. Defaults to 'data/00-metadata/json'.

'data/00-metadata/json'
output_filename str

Name for output file. Defaults to 'variable_metadata.json'.

'variable_metadata.json'

Returns:

Type Description
None

None

Edge Cases
  • Handles missing input files or parser errors gracefully.
  • Avoids duplicate variables by name across all files.
Example

write_variable_metadata()

eencijferho.core.extractor.process_json_folder(storage, json_input_folder='data/00-metadata/json', excel_output_folder='data/00-metadata')

Processes all JSON files in a folder, converting tables to Excel files.

Parameters:

Name Type Description Default
json_input_folder str

Folder containing JSON files. Defaults to 'data/00-metadata/json'.

'data/00-metadata/json'
excel_output_folder str

Output folder for Excel files. Defaults to 'data/00-metadata'.

'data/00-metadata'

Returns:

Type Description
None

None

Edge Cases
  • Removes any existing Excel files in the output folder.
  • Handles missing or empty input folder gracefully.
Example

process_json_folder('data/00-metadata/json', 'data/00-metadata')

eencijferho.core.extractor.get_fwf_params(txt_file, table_index=0)

Extract field names and column specs from a DUO bestandsbeschrijving .txt file.

Returns a dict that can be passed directly to pandas.read_fwf() as keyword arguments, so no manual parsing of the fixed-width metadata is needed::

import pandas as pd
from eencijferho import get_fwf_params

params = get_fwf_params("Bestandsbeschrijving_1cyferho_2023_v1.1.txt")
df = pd.read_fwf("1cyferho_2023.asc", encoding="latin-1", header=None, **params)

The colspecs values follow the pandas convention: 0-based, half-open intervals [start, end). Startpositie from DUO is 1-based, so start = Startpositie - 1 and end = Startpositie - 1 + Aantal posities.

Parameters:

Name Type Description Default
txt_file str

Path to the DUO bestandsbeschrijving .txt (or .asc) file.

required
table_index int

Which table to use when the file contains multiple tables. Defaults to 0 (the first / main table).

0

Returns:

Type Description
dict

{"names": list[str], "colspecs": list[tuple[int, int]]}

Raises:

Type Description
ValueError

When the file cannot be parsed, no tables are found, or table_index is out of range.

Example::

>>> params = get_fwf_params("Bestandsbeschrijving_1cyferho_2023_v1.1.txt")
>>> params["names"][:3]
['Onderwijstype HO', 'BRIN-nummer', 'Opleidingscode (CROHO)']
>>> params["colspecs"][:3]
[(0, 3), (3, 7), (7, 12)]

eencijferho.core.extractor.list_fwf_tables(txt_file)

List the table names found in a DUO bestandsbeschrijving .txt file.

Useful for discovering which tables are available before calling :func:get_fwf_params with a specific table_index::

>>> list_fwf_tables("Bestandsbeschrijving_1cyferho_2023_v1.1.txt")
['1CijferHO 2023']

Parameters:

Name Type Description Default
txt_file str

Path to the DUO bestandsbeschrijving .txt (or .asc) file.

required

Returns:

Type Description
list[str]

List of table titles, in order. Returns an empty list when the file

list[str]

cannot be read or contains no tables.


Validatie

eencijferho.utils.extractor_validation.validate_metadata_folder(storage, metadata_folder='data/00-metadata', return_dict=False)

Validates all Excel files in a metadata folder and returns a summary of results.

Parameters:

Name Type Description Default
metadata_folder str

Path to the folder containing Excel metadata files. Defaults to 'data/00-metadata'.

'data/00-metadata'
return_dict bool

If True, returns a dictionary with validation results for each file. If False, returns None.

False

Returns:

Type Description
dict[str, dict[str, Any]] | None

dict[str, dict[str, Any]] | None: Dictionary mapping file names to their validation results if return_dict is True, otherwise None.

Edge Cases
  • If no Excel files are found, prints a warning and returns empty dict or None.
  • Handles and logs errors for individual files without stopping the batch process.
  • Saves both timestamped and latest logs in the log folder.
Example

results = validate_metadata_folder('data/00-metadata', return_dict=True) for fname, res in results.items(): ... print(fname, res)

eencijferho.utils.converter_validation.converter_validation(storage, conversion_log_path='data/00-metadata/logs/(5)_conversion_log_latest.json', matching_log_path='data/00-metadata/logs/(4)_file_matching_log_latest.json', output_log_path='data/00-metadata/logs/(6)_conversion_validation_log_latest.json')

Validates that row counts in the matching log match the total lines in the conversion log for each processed file.

Parameters:

Name Type Description Default
conversion_log_path str

Path to the conversion log JSON file. Defaults to the latest log.

'data/00-metadata/logs/(5)_conversion_log_latest.json'
matching_log_path str

Path to the file matching log JSON file. Defaults to the latest log.

'data/00-metadata/logs/(4)_file_matching_log_latest.json'
output_log_path str

Path to save the output validation log. Defaults to the latest log.

'data/00-metadata/logs/(6)_conversion_validation_log_latest.json'

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Dictionary containing validation summary, including total files, successful and failed conversions, and per-file details.

Edge Cases
  • If a file is present in the matching log but not in the conversion log, it is ignored.
  • Only files with status 'success' in the conversion log are validated.
  • If row counts do not match, the file is marked as failed with an error message.
  • Handles missing or empty details gracefully.
Example

results = converter_validation() print(results["successful_conversions"])

eencijferho.utils.converter_match.match_files(storage, input_folder, log_path='data/00-metadata/logs/(3)_xlsx_validation_log_latest.json')

Match input files with metadata files and log the results.

Special matching rules: - Files starting with "EV" match with files containing "1cyferho" - Files containing "VAKHAVW" match with files containing "Vakgegevens"


Uitvoer-hulpfuncties

eencijferho.utils.compressor.convert_csv_to_parquet(storage, input_dir=None)

eencijferho.utils.encryptor.encryptor(storage, input_dir=None, output_dir=None)

eencijferho.utils.converter_headers.convert_csv_headers_to_snake_case(storage, input_dir=None, delimiter=';', encoding='utf-8', quote_char='"', infer_schema_length=0)

Convert all CSV file headers in the input directory to snake_case.

Parameters:

Name Type Description Default
input_dir str | None

Path to directory containing CSV files

None
delimiter str

CSV delimiter (default: ";")

';'
encoding str

File encoding (default: "utf-8")

'utf-8'
quote_char str

Quote character to use. Use "" to disable quoting (default: "")

'"'
infer_schema_length int | None

Number of rows to scan for schema inference (default: None - scan all)

0

Configuratie

eencijferho.config.OutputConfig dataclass

Controls which output variants the pipeline produces.

Attributes:

Name Type Description
variants list[str]

Which decoded variants to create. Supported values: "decoded" (Dec-only substitution) and "enriched" (Dec + variable_metadata label substitution).

formats list[str]

Extra output formats. "parquet" compresses each CSV to a Parquet file.

encrypt bool

When True, sensitive columns (e.g. BSN) are encrypted.

column_casing str

Header style applied to all output CSV/Parquet files. "snake_case" converts headers to snake_case; "none" leaves headers unchanged.

convert_ev bool

When True (default), EV main data files are converted from fixed-width to CSV.

convert_vakhavw bool

When True (default), VAKHAVW main data files are converted from fixed-width to CSV.

decode_columns list[str] | None

Column names to decode via Dec_* lookup tables. None decodes all available columns.

enrich_variables list[str] | None

Variable names to enrich via variable_metadata labels. None enriches all available variables.

Example — CSV-only, no encryption, no header rename::

OutputConfig(variants=["decoded"], formats=[], encrypt=False, column_casing="none")