Skip to content

Contributing Guide

Development workflow, environment setup, and testing procedures for Fairway.

Prerequisites

  • Python 3.10+
  • Java 8+ (for PySpark/Spark features)
  • Apptainer/Singularity (for container builds, optional)

Environment Setup

1. Clone the Repository

git clone https://github.com/DISSC-yale/fairway.git
cd fairway

2. Create Virtual Environment

python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows

3. Install Dependencies

# Core installation (lightweight)
pip install -e .

# With DuckDB support (recommended for local dev)
pip install -e ".[duckdb]"

# With PySpark support
pip install -e ".[spark]"

# With all dependencies
pip install -e ".[all]"

# For documentation
pip install -e ".[docs]"

Project Structure

fairway/
├── src/fairway/           # Main package
│   ├── cli.py             # CLI entry points
│   ├── data/              # Container definitions, Makefile
│   └── ...
├── tests/                 # Test suite
│   └── fixtures/          # Test data
├── docs/                  # Documentation (MkDocs)
├── config/                # Example configurations
└── pyproject.toml         # Project metadata and dependencies

Available Scripts

From pyproject.toml:

Command Description
fairway Main CLI entry point (fairway.cli:main)

From src/fairway/data/Makefile (run from project root after fairway init):

Command Description
make help Show all available commands
make run Run full pipeline locally (ingest + summary)
make ingest Run ingestion only (skip summary)
make summary Generate summary stats locally (requires JAVA_HOME)
make submit Submit ingestion to Slurm with Spark cluster
make summary-hpc Submit summary generation to Slurm
make shell Open shell in container
make build Build container image
make schema Generate schema from raw data
make schema-hpc Submit schema generation to Slurm
make clean Remove logs and cache
make shell-dev Open shell with local src/ mounted (no rebuild)
make run-dev Run pipeline with local src/ mounted (no rebuild)

Dependencies

Core Dependencies

  • click - CLI framework
  • pyyaml - YAML configuration parsing
  • pandas - Data manipulation
  • tabulate - Table formatting

Optional Dependencies

Extra Packages Use Case
spark pyspark, delta-spark, pyarrow Distributed processing on Slurm
duckdb duckdb, pyarrow Local development, smaller datasets
redivis redivis, pyarrow Redivis data export
test-data-gen numpy Generating test datasets
docs mkdocs-material Building documentation
all All of the above Full installation

Testing

Running Tests

# Run all tests
pytest

# Run only local tests (no Spark required)
pytest -m local

# Run Spark tests (requires Java)
pytest -m spark

# Run with coverage
pytest --cov=fairway --cov-report=html

Test Markers

Marker Description
local Runs without Spark (DuckDB only)
spark Requires PySpark + Java
hpc Requires SLURM cluster

Test Data

Generate test datasets for development:

# Small partitioned CSV dataset
fairway generate-data --size small --partitioned

# Large Parquet dataset
fairway generate-data --size large --no-partitioned --format parquet

Development Workflow

1. Create a Feature Branch

git checkout -b feature/my-feature

2. Make Changes

  • Write tests first (TDD encouraged)
  • Follow existing code patterns
  • Update documentation if adding features

3. Run Tests

pytest -m local  # Quick feedback
pytest           # Full test suite before PR

4. Build Documentation

mkdocs serve  # Preview at http://localhost:8000
mkdocs build  # Build static site

5. Submit Pull Request

  • Ensure all tests pass
  • Update CHANGELOG if applicable
  • Request review

Environment Variables

Variable Description Default
FAIRWAY_BINDS Additional Apptainer bind paths (comma-separated) Auto-detected from config
FAIRWAY_TEMP Temporary directory for large operations System temp
REDIVIS_API_TOKEN API token for Redivis data export None
SPARK_LOCAL_IP Spark driver bind address Auto-detect
PYSPARK_SUBMIT_ARGS Additional Spark submit arguments Auto-configured

HPC Bind Paths

When running on HPC clusters, you may need to set FAIRWAY_BINDS to include your cluster's shared storage:

# For clusters using /scratch
export FAIRWAY_BINDS="/scratch/$USER"

# For clusters with multiple storage paths
export FAIRWAY_BINDS="/gpfs/data,/project/mygroup"

Alternatively, set apptainer_binds in your spark.yaml:

apptainer_binds: "/scratch,/gpfs"

Code Style

  • Follow PEP 8 guidelines
  • Use type hints where practical
  • Docstrings for public APIs
  • Keep functions focused and testable

Troubleshooting

Java Not Found (Spark tests)

# macOS
brew install openjdk@11
export JAVA_HOME=/opt/homebrew/opt/openjdk@11

# Linux
sudo apt install openjdk-11-jdk
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

DuckDB Import Errors

pip install --upgrade duckdb pyarrow

Container Build Issues

Ensure Apptainer is installed:

apptainer --version
# or build with Docker fallback:
fairway build --docker