API-referentie¶
De publieke API van eencijferho is beschikbaar via directe import:
from eencijferho import run_turbo_convert_pipeline
from eencijferho import process_txt_folder, write_variable_metadata
Alle publieke functies staan in __all__.
Pipeline¶
eencijferho.core.pipeline.run_turbo_convert_pipeline(storage, input_dir='data/01-input', dec_metadata_json=None, output_dir='data/02-output', metadata_dir=None, progress_callback=None, status_callback=None, output_config=None)
¶
Run the full turbo-convert pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_dir
|
str
|
Folder containing input fixed-width / CSV files. |
'data/01-input'
|
dec_metadata_json
|
str | None
|
Path to the DEC bestandsbeschrijving JSON. When None it is derived automatically from metadata_dir. |
None
|
output_dir
|
str
|
Folder where converted files are written. |
'data/02-output'
|
metadata_dir
|
str | None
|
Folder used for metadata (JSON, Excel, logs). When
None it defaults to |
None
|
progress_callback
|
Callable[[int], None] | None
|
Optional callable(int) for progress 0-100. |
None
|
status_callback
|
Callable[[str], None] | None
|
Optional callable(str) for status messages. |
None
|
output_config
|
OutputConfig | None
|
Controls which output variants are produced. When
None the defaults from :class: |
None
|
Extractor¶
eencijferho.core.extractor.process_txt_folder(storage, input_folder, json_output_folder='data/00-metadata/json')
¶
Finds all .txt files containing 'Bestandsbeschrijving' in the root directory and extracts tables from them. Also processes all .asc files in the root directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_folder
|
str
|
Folder to search for .txt and .asc files. |
required |
json_output_folder
|
str
|
Output folder for JSON files. Defaults to 'data/00-metadata/json'. |
'data/00-metadata/json'
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: Paths to all extracted JSON files. |
Edge Cases
- Removes any existing JSON files in the output folder.
- Handles missing input folder gracefully.
Example
process_txt_folder('data/01-input')
eencijferho.core.extractor.write_variable_metadata(storage, input_dir='data/01-input', json_folder='data/00-metadata/json', output_filename='variable_metadata.json')
¶
Scans all Bestandsbeschrijving*.txt files (recursively) from input_dir and writes a consolidated variable metadata JSON.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_dir
|
str
|
Folder to search for input text files. Defaults to 'data/01-input'. |
'data/01-input'
|
json_folder
|
str
|
Output folder for JSON file. Defaults to 'data/00-metadata/json'. |
'data/00-metadata/json'
|
output_filename
|
str
|
Name for output file. Defaults to 'variable_metadata.json'. |
'variable_metadata.json'
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Edge Cases
- Handles missing input files or parser errors gracefully.
- Avoids duplicate variables by name across all files.
Example
write_variable_metadata()
eencijferho.core.extractor.process_json_folder(storage, json_input_folder='data/00-metadata/json', excel_output_folder='data/00-metadata')
¶
Processes all JSON files in a folder, converting tables to Excel files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json_input_folder
|
str
|
Folder containing JSON files. Defaults to 'data/00-metadata/json'. |
'data/00-metadata/json'
|
excel_output_folder
|
str
|
Output folder for Excel files. Defaults to 'data/00-metadata'. |
'data/00-metadata'
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Edge Cases
- Removes any existing Excel files in the output folder.
- Handles missing or empty input folder gracefully.
Example
process_json_folder('data/00-metadata/json', 'data/00-metadata')
eencijferho.core.extractor.get_fwf_params(txt_file, table_index=0)
¶
Extract field names and column specs from a DUO bestandsbeschrijving .txt file.
Returns a dict that can be passed directly to pandas.read_fwf() as
keyword arguments, so no manual parsing of the fixed-width metadata is
needed::
import pandas as pd
from eencijferho import get_fwf_params
params = get_fwf_params("Bestandsbeschrijving_1cyferho_2023_v1.1.txt")
df = pd.read_fwf("1cyferho_2023.asc", encoding="latin-1", header=None, **params)
The colspecs values follow the pandas convention: 0-based, half-open
intervals [start, end). Startpositie from DUO is 1-based, so
start = Startpositie - 1 and end = Startpositie - 1 + Aantal posities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
txt_file
|
str
|
Path to the DUO bestandsbeschrijving |
required |
table_index
|
int
|
Which table to use when the file contains multiple tables. Defaults to 0 (the first / main table). |
0
|
Returns:
| Type | Description |
|---|---|
dict
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
When the file cannot be parsed, no tables are found, or
|
Example::
>>> params = get_fwf_params("Bestandsbeschrijving_1cyferho_2023_v1.1.txt")
>>> params["names"][:3]
['Onderwijstype HO', 'BRIN-nummer', 'Opleidingscode (CROHO)']
>>> params["colspecs"][:3]
[(0, 3), (3, 7), (7, 12)]
eencijferho.core.extractor.list_fwf_tables(txt_file)
¶
List the table names found in a DUO bestandsbeschrijving .txt file.
Useful for discovering which tables are available before calling
:func:get_fwf_params with a specific table_index::
>>> list_fwf_tables("Bestandsbeschrijving_1cyferho_2023_v1.1.txt")
['1CijferHO 2023']
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
txt_file
|
str
|
Path to the DUO bestandsbeschrijving |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of table titles, in order. Returns an empty list when the file |
list[str]
|
cannot be read or contains no tables. |
Validatie¶
eencijferho.utils.extractor_validation.validate_metadata_folder(storage, metadata_folder='data/00-metadata', return_dict=False)
¶
Validates all Excel files in a metadata folder and returns a summary of results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_folder
|
str
|
Path to the folder containing Excel metadata files. Defaults to 'data/00-metadata'. |
'data/00-metadata'
|
return_dict
|
bool
|
If True, returns a dictionary with validation results for each file. If False, returns None. |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, Any]] | None
|
dict[str, dict[str, Any]] | None: Dictionary mapping file names to their validation results if return_dict is True, otherwise None. |
Edge Cases
- If no Excel files are found, prints a warning and returns empty dict or None.
- Handles and logs errors for individual files without stopping the batch process.
- Saves both timestamped and latest logs in the log folder.
Example
results = validate_metadata_folder('data/00-metadata', return_dict=True) for fname, res in results.items(): ... print(fname, res)
eencijferho.utils.converter_validation.converter_validation(storage, conversion_log_path='data/00-metadata/logs/(5)_conversion_log_latest.json', matching_log_path='data/00-metadata/logs/(4)_file_matching_log_latest.json', output_log_path='data/00-metadata/logs/(6)_conversion_validation_log_latest.json')
¶
Validates that row counts in the matching log match the total lines in the conversion log for each processed file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conversion_log_path
|
str
|
Path to the conversion log JSON file. Defaults to the latest log. |
'data/00-metadata/logs/(5)_conversion_log_latest.json'
|
matching_log_path
|
str
|
Path to the file matching log JSON file. Defaults to the latest log. |
'data/00-metadata/logs/(4)_file_matching_log_latest.json'
|
output_log_path
|
str
|
Path to save the output validation log. Defaults to the latest log. |
'data/00-metadata/logs/(6)_conversion_validation_log_latest.json'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Dictionary containing validation summary, including total files, successful and failed conversions, and per-file details. |
Edge Cases
- If a file is present in the matching log but not in the conversion log, it is ignored.
- Only files with status 'success' in the conversion log are validated.
- If row counts do not match, the file is marked as failed with an error message.
- Handles missing or empty details gracefully.
Example
results = converter_validation() print(results["successful_conversions"])
eencijferho.utils.converter_match.match_files(storage, input_folder, log_path='data/00-metadata/logs/(3)_xlsx_validation_log_latest.json')
¶
Match input files with metadata files and log the results.
Special matching rules: - Files starting with "EV" match with files containing "1cyferho" - Files containing "VAKHAVW" match with files containing "Vakgegevens"
Uitvoer-hulpfuncties¶
eencijferho.utils.compressor.convert_csv_to_parquet(storage, input_dir=None)
¶
eencijferho.utils.encryptor.encryptor(storage, input_dir=None, output_dir=None)
¶
eencijferho.utils.converter_headers.convert_csv_headers_to_snake_case(storage, input_dir=None, delimiter=';', encoding='utf-8', quote_char='"', infer_schema_length=0)
¶
Convert all CSV file headers in the input directory to snake_case.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_dir
|
str | None
|
Path to directory containing CSV files |
None
|
delimiter
|
str
|
CSV delimiter (default: ";") |
';'
|
encoding
|
str
|
File encoding (default: "utf-8") |
'utf-8'
|
quote_char
|
str
|
Quote character to use. Use "" to disable quoting (default: "") |
'"'
|
infer_schema_length
|
int | None
|
Number of rows to scan for schema inference (default: None - scan all) |
0
|
Configuratie¶
eencijferho.config.OutputConfig
dataclass
¶
Controls which output variants the pipeline produces.
Attributes:
| Name | Type | Description |
|---|---|---|
variants |
list[str]
|
Which decoded variants to create. Supported values:
|
formats |
list[str]
|
Extra output formats. |
encrypt |
bool
|
When True, sensitive columns (e.g. BSN) are encrypted. |
column_casing |
str
|
Header style applied to all output CSV/Parquet files.
|
convert_ev |
bool
|
When True (default), EV main data files are converted from fixed-width to CSV. |
convert_vakhavw |
bool
|
When True (default), VAKHAVW main data files are converted from fixed-width to CSV. |
decode_columns |
list[str] | None
|
Column names to decode via Dec_* lookup tables.
|
enrich_variables |
list[str] | None
|
Variable names to enrich via variable_metadata labels.
|
Example — CSV-only, no encryption, no header rename::
OutputConfig(variants=["decoded"], formats=[], encrypt=False, column_casing="none")