Configuration Guide
fairway pipelines are driven by YAML configuration files. This allows you to define data sources, metadata extraction, validations, and enrichments without writing pipeline code.
Root Options
| Field | Description | Default |
|---|---|---|
dataset_name |
A unique identifier for the dataset. | Required |
engine |
Data processing engine (duckdb or pyspark). |
duckdb |
partition_by |
List of columns to partition the output Parquet files by. | [] |
temp_location |
Global temporary location for file writes. | None |
Storage
The storage section defines where processed data is written.
storage:
raw_dir: "data/raw"
intermediate_dir: "data/intermediate"
final_dir: "data/final"
# Optional: Fast scratch storage for intermediate writes (HPC clusters)
scratch_dir: "/scratch/$USER/fairway"
| Field | Description | Default |
|---|---|---|
raw_dir |
Directory for raw input files. | Required |
intermediate_dir |
Directory for intermediate processed files. | Required |
final_dir |
Directory for final output files. | Required |
scratch_dir |
Fast scratch storage for intermediate writes. Supports environment variables like $USER. Useful on HPC clusters where intermediate_dir may be on slow or quota-limited storage. |
None |
Performance
The performance section controls optimization settings for the pipeline.
performance:
target_rows: 500000 # Rows per partition (for salting calculation)
target_file_size_mb: 128 # Target parquet file size in MB
max_records_per_file: 1000000 # Direct control (overrides heuristic)
salting: false # Enable partition salting for data skew prevention
compression: snappy # Parquet compression codec
| Field | Description | Default |
|---|---|---|
target_rows |
Target number of rows per partition. Used when salting is enabled to calculate the number of salt buckets. | 500000 |
target_file_size_mb |
Target size for output Parquet files in megabytes. Uses a heuristic (~8000 rows/MB) to estimate maxRecordsPerFile. Actual file size varies by data characteristics. |
128 |
max_records_per_file |
Direct control over Spark's maxRecordsPerFile option. Overrides the target_file_size_mb heuristic. Use when files are consistently too small or too large. |
None |
salting |
Enable partition salting to prevent data skew. When enabled, adds a salt column to distribute data evenly across partitions. Only applies when partition_by is set. |
false |
compression |
Parquet compression codec. Options: snappy, gzip, zstd. |
snappy |
File Size Tuning
The target_file_size_mb option uses a heuristic to estimate how many rows fit in a given file size. This varies significantly based on:
- Column count: Wide tables (100+ columns) → ~500-2,000 rows/MB
- Narrow tables: 10 columns → ~5,000-20,000 rows/MB
- Data types: Strings compress less than integers
If your output files are consistently the wrong size, use max_records_per_file for direct control:
performance:
# If files are too small (e.g., 20MB instead of 128MB), increase this:
max_records_per_file: 2000000
When to Enable Salting
Salting is useful when: - Your data has highly skewed partition keys (e.g., 90% of data in one partition) - You're experiencing slow writes due to uneven data distribution - You need to balance load across Spark executors
Note: Salting adds a salt column to your output data and requires a full data count operation, which can be expensive for very large datasets.
Tables
The tables section defines your data sources and how to process them.
tables:
- name: "provider_extract"
path: "data/raw/provider_*.csv"
root: "/data/shared" # Optional: root directory for path resolution
naming_pattern: "provider_(?P<state>[A-Z]{2})_(?P<date>\\d{8})\\.csv"
format: "csv"
File Discovery
fairway uses glob to discover files matching path. If root is specified, the path is resolved relative to it.
Metadata Extraction
If a naming_pattern (Python regex) is provided, fairway extracts named groups from the filename and injects them as columns into the data. In the example above, a file named provider_CT_20230101.csv will have state='CT' and date='20230101' added to every row.
Fixed-Width Options
For format: "fixed_width", additional options are available:
| Field | Description | Default |
|---|---|---|
fixed_width_spec |
Path to YAML spec defining column positions | Required |
min_line_length |
Skip lines shorter than this (corrupted data) | None |
type_enforcement.on_fail |
null (TRY_CAST, default) or strict (CAST) |
null |
See Fixed-Width Format for full documentation.
Preprocessing
Tables can define preprocessing steps to run before ingestion (e.g., extracting zip files, converting codebooks):
tables:
- name: "census_data"
path: "data/raw/*.zip"
format: "fixed_width"
preprocess:
action: "scripts/preprocess_ipums.py" # Custom script or "unzip"
scope: "per_file" # Process each matched file independently
password_file: "/path/to/password.txt" # Optional: for encrypted zips
The preprocessing script must define a process_file(file_path, output_dir, **kwargs) function that extracts/transforms data and returns the path to the processed output.
Validations
fairway provides built-in data quality checks. Validations can be set globally and overridden per-table.
# Global validations (inherited by all tables)
validations:
min_rows: 100
check_nulls: ["person_id"]
tables:
- name: demographics
path: "data/raw/demographics.csv"
format: csv
# Per-table override (shallow merge with global)
validations:
min_rows: 500
check_range:
year: { min: 1900, max: 2030 }
check_pattern:
fips_code: "^\\d{5}$"
check_values:
state: ["CT", "MA", "NY"]
Available checks: min_rows, max_rows, check_nulls, expected_columns, check_range, check_values, check_pattern.
See Validations for full documentation of each check type, severity/threshold support, and cross-engine behavior.
output_layer
Controls where a table's pipeline stops:
tables:
- name: demographics
output_layer: curated # default — full pipeline (validate → processed → transform → curated)
- name: reference_lookup
output_layer: processed # validate → processed, stop. No transforms or curated write.
| Value | Behavior |
|---|---|
curated (default) |
Full pipeline: validate → processed → transform → type-enforce → curated |
processed |
Validate → processed only. No transform/curated steps |
Note: output_layer: processed with a transformation specified is a config error.
Enrichment
Enable built-in enrichments like geospatial processing:
enrichment:
geocode: true
Custom Transformations
If your data requires complex reshaping, you can point to a custom transformation script for each source.
tables:
- name: "sales"
path: "data/raw/sales.csv"
format: "csv"
transformation: "src/transformations/sales_cleaner.py"
The pipeline will look for a class in the specified script that implements the transformation logic. Global transformations (under data.transformation) are deprecated.
Config Auto-Discovery
When running fairway run without specifying --config, fairway will automatically discover the config file:
- Scans the
config/directory for.yamlor.ymlfiles - Excludes
*_schema.yamlfiles andspark.yaml - If exactly one config is found, uses it automatically
- If multiple configs exist, shows an error listing them—use
--configto specify
# Auto-discovers config/fairway.yaml (if it's the only config file)
fairway run
# Explicit config selection (required when multiple configs exist)
fairway run --config config/my_pipeline.yaml
Spark Cluster Config (spark.yaml)
For Slurm/PySpark execution, resource settings are configured in config/spark.yaml:
# config/spark.yaml
nodes: 2
cpus_per_node: 32
mem_per_node: "200G"
account: "my_account"
partition: "day"
time: "24:00:00"
dynamic_allocation:
enabled: true
min_executors: 5
max_executors: 150
initial_executors: 15
CLI options (e.g., --mem 64G) override spark.yaml values.
Slurm Submission
Submit jobs to Slurm using fairway submit:
# Basic submission (uses defaults from spark.yaml)
fairway submit
# With Spark cluster
fairway submit --with-spark
# With custom resources
fairway submit --with-spark --mem 64G --cpus 8 --time 48:00:00
| Option | Default | Description |
|---|---|---|
--partition |
day |
Slurm partition |
--time |
24:00:00 |
Time limit (HH:MM:SS) |
--mem |
16G |
Memory per node |
--cpus |
4 |
CPUs per task |
--account |
From spark.yaml | Slurm account |
--with-spark |
False | Provision Spark cluster |
--dry-run |
False | Preview job script |