Operations Runbook

Deployment procedures, monitoring, troubleshooting, and rollback procedures for Fairway pipelines.

Deployment Procedures

Local Deployment

Install Fairway

pip install "git+https://github.com/DISSC-yale/fairway.git#egg=fairway[duckdb]"

Initialize Project
```
fairway init my_project
cd my_project
```
Configure Pipeline
Edit config/fairway.yaml with your data sources
Set engine: duckdb for local execution
Run Pipeline
```
fairway run
```

HPC/Slurm Deployment

Install with Spark Support

pip install "git+https://github.com/DISSC-yale/fairway.git#egg=fairway[spark]"

Initialize Project

fairway init my_project --engine spark
cd my_project

Configure Resources (config/spark.yaml)

nodes: 2
cpus_per_node: 32
mem_per_node: "200G"
account: "your_slurm_account"
partition: "day"
time: "24:00:00"

Build Container (recommended for reproducibility)
```
fairway build
```
Submit Pipeline
```
fairway submit --with-spark
```

Container Deployment

Build Container Image

# Apptainer (preferred for HPC)
fairway build --apptainer

# Docker (alternative)
fairway build --docker

Pull Pre-built Image
```
fairway pull
```

Run Inside Container

fairway shell
# Inside container:
fairway run

Monitoring

Job Status

# Check running jobs
fairway status

# View detailed job info (Slurm)
squeue -u $USER --format="%.18i %.9P %.30j %.8T %.10M %.6D %R"

# Check Spark cluster status
fairway spark status

Logs

# View all logs
fairway logs

# View last 20 entries
fairway logs --last 20

# View errors only
fairway logs --errors

# Filter by batch ID
fairway logs --batch claims_CT_2023

# Raw JSON output for advanced filtering
fairway logs --json | jq 'select(.level == "ERROR")'

Manifest (Processed Files)

# Show manifest entries
fairway manifest show

# Query by status
fairway manifest query --status completed
fairway manifest query --status failed

# Query by batch
fairway manifest query --batch claims_2023

Resource Usage

# Slurm job efficiency
seff <JOB_ID>

# Real-time resource monitoring
sstat -j <JOB_ID> --format=JobID,MaxRSS,MaxVMSize,AveCPU

Common Issues and Fixes

Issue: Pipeline Fails with Exit Code 115

Cause: Data integrity violation (RULE-115) - schema mismatch detected.

Solution: 1. Check for new columns in source data:

fairway generate-schema data/raw/new_file.csv

2. Update schema in fairway.yaml or switch to Delta Lake format:

storage:
  format: "delta"  # Enables schema evolution

Issue: Out of Memory (OOM) Errors

Cause: Data too large for allocated resources.

Solution: 1. Increase memory allocation:

fairway submit --with-spark --mem 64G

2. Enable salting for skewed data:

performance:
  salting: true

3. Reduce parallelism:

performance:
  target_file_size_mb: 256  # Larger files = fewer tasks

Issue: Spark Cluster Won't Start

Cause: Network/connectivity issues, resource unavailability.

Solution: 1. Check Slurm node availability:

sinfo -p day

2. Verify SPARK_LOCAL_IP:

export SPARK_LOCAL_IP=$(hostname -i)

3. Check firewall/port access on compute nodes

Issue: Files Not Being Processed

Cause: Files already in manifest as "completed".

Solution: 1. Check manifest status:

fairway manifest query --status completed

2. Reset specific files for reprocessing:

fairway manifest reset --file "data/raw/claims_CT_2023.csv"

3. Clear entire cache:

fairway cache clear

Issue: Java Not Found

Cause: JAVA_HOME not set or Java not installed.

Solution: 1. Load Java module (HPC):

module load Java/11
export JAVA_HOME=$JAVA_HOME

2. Install Java locally:

# macOS
brew install openjdk@11
export JAVA_HOME=/opt/homebrew/opt/openjdk@11

Issue: Container Mount Fails (Bind Path Not Found)

Cause: Apptainer trying to bind a path that doesn't exist on your cluster (e.g., /vast on a non-Yale cluster).

Error:

FATAL: container creation failed: mount hook function failure: mount /vast->/vast error: while mounting /vast: mount source /vast doesn't exist

Solution: 1. Set the correct bind path for your cluster:

# Environment variable (temporary)
export FAIRWAY_BINDS="/scratch/$USER,/gpfs/data"
fairway submit --with-spark

Or add to spark.yaml (permanent):
```
apptainer_binds: "/scratch,/gpfs"
```
Common cluster paths:
Yale: /vast
Grace/Farnam: /gpfs/ycga
Generic: /scratch, /project, /home

Issue: Container Build Fails

Cause: Missing dependencies or disk space.

Solution: 1. Eject and customize container definition:

# Eject everything
fairway eject

# Or eject just container files
fairway eject --container

# Or eject to custom directory
fairway eject --output custom/

2. Build with verbose output:

apptainer build --debug fairway.sif Apptainer.def

3. Try Docker fallback:

fairway build --docker

Issue: Preprocessed Files Not All Ingested

Cause: When multiple archives contain files with the same basename, older versions of Fairway could lose track of some files in the manifest.

Solution: 1. Clear the table manifest to force reprocessing:

rm manifest/<table_name>.json

2. Verify all extracted files exist in the preprocessing scratch dir:

ls -R /path/to/scratch/fairway/<table_name>_v1/

3. Re-run the pipeline — Fairway now uses the preprocessing output directory as the manifest root, producing unique keys for files from different archives.

Issue: Slow Pipeline Performance

Cause: Data skew, suboptimal file sizes, or insufficient resources.

Solution: 1. Enable salting for skewed partition keys:

performance:
  salting: true
  target_rows: 500000

2. Tune file sizes:

performance:
  target_file_size_mb: 128  # ~128MB files

3. Use scratch storage for intermediate writes:

storage:
  scratch_dir: "/scratch/$USER/fairway"

Rollback Procedures

Rollback Pipeline Run

Identify Failed Batch

fairway logs --errors
fairway manifest query --status failed

Reset Failed Files

fairway manifest reset --batch <BATCH_ID>

Restore Previous Output (if needed)

# If using Delta Lake, use time travel
# Otherwise, restore from backup
cp -r data/final.backup/* data/final/

Re-run Pipeline

fairway run  # Will only process reset files

Rollback Schema Changes

Identify Schema Version

ls -la config/*_schema.yaml
git log --oneline config/

Restore Previous Schema

git checkout HEAD~1 -- config/my_schema.yaml

Clear and Reprocess
```
fairway cache clear
fairway run
```

Cancel Running Jobs

# Cancel specific job
fairway cancel <JOB_ID>

# Cancel all your jobs (requires confirmation)
fairway cancel --all

# Force cancel without confirmation
scancel -u $USER

Health Checks

Pre-Run Checklist

[ ] Config file exists and is valid: fairway run --dry-run
[ ] Source data accessible: ls data/raw/
[ ] Sufficient disk space: df -h data/
[ ] Java available (for Spark): java -version
[ ] Container built (if using): ls *.sif

Post-Run Validation

[ ] Check exit code: echo $? (0 = success)
[ ] Review logs for errors: fairway logs --errors
[ ] Verify output files: ls data/final/
[ ] Check manifest status: fairway manifest show
[ ] Validate row counts match expectations

Exit Codes Reference

Code	Meaning	Action
0	Success	None required
1	General error	Check logs for details
2	Configuration error	Validate YAML syntax and required fields
115	Data integrity error (RULE-115)	Schema mismatch - update schema or use Delta Lake

Contact and Escalation

For issues not covered in this runbook: 1. Check existing GitHub issues 2. Open a new issue at https://github.com/DISSC-yale/fairway/issues 3. Include: error message, logs, config (sanitized), and steps to reproduce