Contributing Guide
Development workflow, environment setup, and testing procedures for Fairway.
Prerequisites
- Python 3.10+
- Java 8+ (for PySpark/Spark features)
- Apptainer/Singularity (for container builds, optional)
Environment Setup
1. Clone the Repository
git clone https://github.com/DISSC-yale/fairway.git
cd fairway
2. Create Virtual Environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or: .venv\Scripts\activate # Windows
3. Install Dependencies
# Core installation (lightweight)
pip install -e .
# With DuckDB support (recommended for local dev)
pip install -e ".[duckdb]"
# With PySpark support
pip install -e ".[spark]"
# With all dependencies
pip install -e ".[all]"
# For documentation
pip install -e ".[docs]"
Project Structure
fairway/
├── src/fairway/ # Main package
│ ├── cli.py # CLI entry points
│ ├── data/ # Container definitions, Makefile
│ └── ...
├── tests/ # Test suite
│ └── fixtures/ # Test data
├── docs/ # Documentation (MkDocs)
├── config/ # Example configurations
└── pyproject.toml # Project metadata and dependencies
Available Scripts
From pyproject.toml:
| Command | Description |
|---|---|
fairway |
Main CLI entry point (fairway.cli:main) |
From src/fairway/data/Makefile (run from project root after fairway init):
| Command | Description |
|---|---|
make help |
Show all available commands |
make run |
Run full pipeline locally (ingest + summary) |
make ingest |
Run ingestion only (skip summary) |
make summary |
Generate summary stats locally (requires JAVA_HOME) |
make submit |
Submit ingestion to Slurm with Spark cluster |
make summary-hpc |
Submit summary generation to Slurm |
make shell |
Open shell in container |
make build |
Build container image |
make schema |
Generate schema from raw data |
make schema-hpc |
Submit schema generation to Slurm |
make clean |
Remove logs and cache |
make shell-dev |
Open shell with local src/ mounted (no rebuild) |
make run-dev |
Run pipeline with local src/ mounted (no rebuild) |
Dependencies
Core Dependencies
click- CLI frameworkpyyaml- YAML configuration parsingpandas- Data manipulationtabulate- Table formatting
Optional Dependencies
| Extra | Packages | Use Case |
|---|---|---|
spark |
pyspark, delta-spark, pyarrow | Distributed processing on Slurm |
duckdb |
duckdb, pyarrow | Local development, smaller datasets |
redivis |
redivis, pyarrow | Redivis data export |
test-data-gen |
numpy | Generating test datasets |
docs |
mkdocs-material | Building documentation |
all |
All of the above | Full installation |
Testing
Running Tests
# Run all tests
pytest
# Run only local tests (no Spark required)
pytest -m local
# Run Spark tests (requires Java)
pytest -m spark
# Run with coverage
pytest --cov=fairway --cov-report=html
Test Markers
| Marker | Description |
|---|---|
local |
Runs without Spark (DuckDB only) |
spark |
Requires PySpark + Java |
hpc |
Requires SLURM cluster |
Test Data
Generate test datasets for development:
# Small partitioned CSV dataset
fairway generate-data --size small --partitioned
# Large Parquet dataset
fairway generate-data --size large --no-partitioned --format parquet
Development Workflow
1. Create a Feature Branch
git checkout -b feature/my-feature
2. Make Changes
- Write tests first (TDD encouraged)
- Follow existing code patterns
- Update documentation if adding features
3. Run Tests
pytest -m local # Quick feedback
pytest # Full test suite before PR
4. Build Documentation
mkdocs serve # Preview at http://localhost:8000
mkdocs build # Build static site
5. Submit Pull Request
- Ensure all tests pass
- Update CHANGELOG if applicable
- Request review
Environment Variables
| Variable | Description | Default |
|---|---|---|
FAIRWAY_BINDS |
Additional Apptainer bind paths (comma-separated) | Auto-detected from config |
FAIRWAY_TEMP |
Temporary directory for large operations | System temp |
REDIVIS_API_TOKEN |
API token for Redivis data export | None |
SPARK_LOCAL_IP |
Spark driver bind address | Auto-detect |
PYSPARK_SUBMIT_ARGS |
Additional Spark submit arguments | Auto-configured |
HPC Bind Paths
When running on HPC clusters, you may need to set FAIRWAY_BINDS to include your cluster's shared storage:
# For clusters using /scratch
export FAIRWAY_BINDS="/scratch/$USER"
# For clusters with multiple storage paths
export FAIRWAY_BINDS="/gpfs/data,/project/mygroup"
Alternatively, set apptainer_binds in your spark.yaml:
apptainer_binds: "/scratch,/gpfs"
Code Style
- Follow PEP 8 guidelines
- Use type hints where practical
- Docstrings for public APIs
- Keep functions focused and testable
Troubleshooting
Java Not Found (Spark tests)
# macOS
brew install openjdk@11
export JAVA_HOME=/opt/homebrew/opt/openjdk@11
# Linux
sudo apt install openjdk-11-jdk
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
DuckDB Import Errors
pip install --upgrade duckdb pyarrow
Container Build Issues
Ensure Apptainer is installed:
apptainer --version
# or build with Docker fallback:
fairway build --docker