Ingestion Options
Fairway provides flexible options for handling messy or non-standard data sources.
Read Options (read_options)
You can pass engine-specific options directly to the underlying reader (DuckDB or Spark) using read_options.
Common Use Cases
Headerless CSVs
If your file has no header, you must provide a schema and set header: false. Fairway will map the schema column names to the file.
sources:
- name: "no_header_data"
path: "data/raw_data.csv"
format: "csv"
schema:
id: "INTEGER"
value: "DOUBLE"
read_options:
header: false
Custom Delimiters
read_options:
delim: "|"
quote: "'"
Preprocessing (preprocess)
Fairway can handle compressed or complex files before ingestion using the preprocess hook.
Actions
- Built-in:
unzip(Extracts zip files). - Custom Script: Provide a path to a python file (e.g.,
scripts/parser.py).- The script must define a function
process(input_path: str) -> str. - It should return the path to the processed file/directory.
- The script must define a function
Global vs Per-File
scope: global: Runs ONCE for the source entry.scope: per_file: Runs for EACH file matching the path pattern.
Execution Modes
execution_mode: driver(Default): Runs locally.execution_mode: cluster: Distributes tasks to Spark Executors.- Requirement:
engine: pyspark. - Benefit: Massively parallel.
- Requirement:
Configuration Example
sources:
- name: "custom_etl"
path: "data/incoming/*.dat"
preprocess:
action: "scripts/decrypt.py"
scope: "per_file"
execution_mode: "cluster"
write_mode: "append"