Standards • dcf

Aside from the stricter standards that are established and checked by the package, there are many soft standards to try to align with as part of building up a source project.

Naming

There are a few general guidelines for naming files:

Only use portable characters: Best to stick to a-z0-9_-, but certainly never use :.
Keep names short: There is a total path length limit on Windows, so avoid long files names especially when deeply nested within directories. Avoid duplicating information in the path (e.g., instead of category/category_data.csv use category/data.csv).
Avoid new files: When a file represents the same thing (e.g., result of a download from a given source), it should keep the same name, as opposed to having dates or version numbers appended to the name. New versions of files should overwrite previous versions, potentially after being merged, deepening on the data source. Versions of files are retained in the git tree, rather than in separate files.

And there are similar considerations when naming variables:

Best to stick to a limited set of characters (a-z0-9_).
Keep lengths minimal, while still being identifiable – you should be able to tell what the variable means from the name, but complete information should be stored in the measure info entry. For instance, only include subset or value-related information if there are multiple variants (e.g., value_count and value_percent).
Make names unique across source projects. This means including enough relevant source information. The source may implicitly include information about the value, so this should be kept out of the name, and only made explicit in the measure info.

Compression

Almost any data files that can be compressed should be compressed.

The main reason not to compress a file is if it is meant for viewing, rather than being read in.

Gzip is the most portable type of compression, and files that are gzip-compressed can be read in from a URL, rather than needing to be downloaded and read in separately. This makes gzip good for the standard output files.

LZMA (xz) generally results in smaller files, so it may be best for raw files.

Parquet files default to snappy compression, but can also use gzip. Gzip generally results in smaller files, but is slightly less readily usable in browsers, so it may be best to use snappy if files are meant for the web, and use gzip otherwise.

The vroom package, among others, automatically compresses when writing, and decompresses when reading, based on the file name:

data <- vroom::vroom("data.csv.xz")
vroom::vroom_write(data, "data.csv.xz", ",")

The some standard functions (like read.csv) now automatically decompress, but do not automatically compress, so a connection must be used when writing:

data <- read.csv("data.csv.xz")
write.csv(data, xzfile("data.csv.xz"), row.names = FALSE)

If a function doesn’t automatically handle compression extensions, but does accept a connection, you can use the gzfile function across compression types to read:

data <- arrow::read_csv_arrow(gzfile("data.csv.xz"))

Scripts

All automated scripts must run within a fresh remote machine. Packages used within the script should be available, depending on the project’s renv.lock file being up to date (potentially updated with dcf_update_lock). But no absolute file paths, and no relative local paths outside of the project should be used.

Scripts are run separately in their own environment, so no information will be passed between them. If you need to pass information between scripts, write it to a file both scripts can access (that is within the project).

Regulating Runs

Within Scripts

Within scripts, you might want to make complete re-running depend on the state of the original data (e.g., the date that data were last update). The state you have available will depend on the source, but once you have a state value, you can store it in the project’s file to refer to between runs.

The most general state would be the hash of the raw files, so an ingest.R file might look like this:

# calculate the raw state
raw_state <- as.list(tools::md5sum(list.files(
  "raw",
  recursive = TRUE,
  full.names = TRUE
)))

# read the project's process file
process <- dcf::dcf_process_record()

# process raw only if state has changed
if (!identical(process$raw_state, raw_state)) {

  # some code to read raw files and write standard files
  
  # write the new raw state to the project's process file
  process$raw_state <- raw_state
  dcf::dcf_process_record(updated = process)
}

This state value isn’t ideal since you would need to re-download files to calculate it. Better to use some form of state ID provided from the source (such as a file hash or update data).

Within Source Projects

In source projects, the process.json has 2 fields that can be used to control when a script is run:

If manual is true, the script will be skipped when run from dcf_built – it will only run from dcf_process. This may be useful if the script depend on local resources or a manual process, so you only want it to run locally.
If frequency is not 0, the script will only run every frequency days. This may be useful if your build process is run frequently, but you know a particular script will need to be run less frequently (e.g., if the data source only updates once a year).

Checking Data

The ingestion or build scripts are ultimately responsible for producing valid output files. Ideally, invalid outputs are caught somewhere in the process, but where and how strictly this happens may depend on the type of issue, how scripts are written, and the general workflow.

Weak Checks

The dcf_check function is run as part of the build process, and will flag some data issues, which are included in later reports, but this is late in the process, after data are already in place.

When you are first setting up an ingestion/build process, or if you are making changes to one, you should run dcf_process and dcf_check locally to ensure everything works at a baseline.

The issues flagged by dcf_check are very general, and are not strongly enforced. If you are processing locally, and checking before pushing to a remote repository, this can be a way to catch issues before changes are made to the remote repository. Otherwise, this is a way to flag issues for correction after changes have been made.

Strong Checks

It may be useful to make stronger demands of the data, such as in cases where the data structure or value formats are liable to change from the source. In cases like this, if you need to ensure the standardized or distribution data stay consistent, scripts should be written in a way that is sensitive to relevant changes such that they fail if breaking changes are made.

This can be done explicitly, which can make it easier to see what went wrong:

# check if the format is as expected
if (!all(grepl("^\\d{4}-\\d{2}-\\d{2}$", data$time))) {
  stop("time is not in the expected format")
}

Though it may also be possibly to handle some variation from the source. For instance, this would accept alternate incoming formats but produce a consistent output format:

# standardize time format
data$time <- format(
  as.Date(data$time, tryFormats = c("%Y-%m-%d", "%m/%d/%Y")),
  "%Y-%m-%d"
)
if (anyNA(data$time)) {
  stop("time has invalid values")
}