Manifest & Caching
Fairway uses a manifest system to track processed files and avoid redundant work. This enables incremental processing where only changed files are reprocessed.
Directory Structure
project/
├── manifest/
│ ├── _global.json # Shared state (archive extractions)
│ ├── sales.json # Per-table: files, preprocessing, schema
│ └── customers.json
├── schema/
│ ├── sales.yaml # Inferred schema for each table
│ └── customers.yaml
Per-Table Manifests
Each table gets its own manifest file at manifest/<table_name>.json. This provides:
- Isolation: No merge conflicts when processing tables in parallel
- Clarity: Easy to inspect state for a specific table
- Granularity: Can selectively reset a single table's state
Table Manifest Structure
{
"version": "3.0",
"table_name": "sales",
"files": {
"2023/data.csv": {
"hash": "mtime:1234567890_size:1024",
"last_processed": "2024-01-15T10:30:00",
"status": "success",
"metadata": {"row_count": 50000}
}
},
"preprocessing": {
"archive.zip": {
"file_hash": "mtime:..._size:...",
"preprocessed_path": "/tmp/extracted/",
"action": "unzip"
}
},
"schema": {
"generated_at": "2024-01-15T10:00:00",
"output_path": "schema/sales.yaml",
"combined_hash": "abc123..."
}
}
Global Manifest
Shared state that spans multiple tables is stored in manifest/_global.json:
{
"version": "3.0",
"extractions": {
"archive:/path/to/shared_data.zip": {
"extracted_dir": ".fairway_cache/archives/shared_data_zip_a1b2c3d4",
"archive_hash": "mtime:..._size:...",
"extracted_at": "2024-01-15T09:00:00",
"used_by_tables": ["sales", "customers"]
}
}
}
This allows multiple tables to share the same archive extraction without redundant unzipping.
Change Detection
Fairway uses fast change detection based on file metadata:
- Fast check (default): Uses
mtime + sizesignature - Full hash: SHA-256 of file contents (slower, more reliable)
Files are only reprocessed when their hash changes.
Resetting State
To force reprocessing:
# Reset a single table
rm manifest/sales.json
# Reset all tables
rm -rf manifest/
# Reset extractions only
rm manifest/_global.json
File Key Generation
Manifest entries are keyed by relative file path. When files are under the configured root directory, the key is the relative path from that root (e.g., 2023/data.csv).
For preprocessed files (extracted from archives to a scratch directory), Fairway uses the preprocessing output directory (batch_dir) as the root for key generation. This ensures files with the same basename but from different archives get distinct keys:
# Two zips both contain "data.dat"
batch_dir/yale_1950a_xxx/data.dat → key: "yale_1950a_xxx/data.dat"
batch_dir/yale_1950b_xxx/data.dat → key: "yale_1950b_xxx/data.dat"
Schema Tracking
When using fairway discover-schema, schema staleness is tracked per-table. A schema is considered stale when:
- No previous schema exists
- Source file hashes have changed since last generation
This prevents unnecessary schema regeneration on repeated runs.