Fairway Data Ingestion
fairway is a portable, scalable data ingestion framework designed for sustainable management of centralized research data.
Core Philosophy
Traditional data ingestion often suffers from undocumented transformations, rigid pipelines, and difficult-to-scale infrastructure. fairway addresses these pain points by being:
- Config-Driven: Define your pipeline in YAML, not just code.
- Engine-Agnostic: Shift from local DuckDB processing to distributed PySpark on Slurm with a single config change.
- HPC-Ready: Native Slurm integration with automatic Spark cluster provisioning.
- Validation-First: Multi-level sanity and distribution checks are baked into the pipeline.
Where to Start?
- Getting Started: Install fairway and run your first pipeline.
- Architecture: Understand the underlying pipeline lifecycle and tech stack.
- Configuration Guide: Learn how to define sources, metadata, and validations.
- Custom Transformations: Extend the pipeline with your own processing logic.
- Manifest & Caching: How Fairway tracks processed files for incremental processing.