Validate the source data

Background and Strategy

Once the source data are loaded, their contents should be reviewed for basic data integrity issues.

The purpose of this task is not to scrutinize the business or clinical content of the source data (this will happen later), but rather to surface common problems introduced in the source data extraction and acquisition processes, e.g., columns that missing (or contain only NULL), formatting problems, files that were meant to contain one type of record but clearly contain something else, many more or less records than expected, incorrect coverage periods, etc.

This activity is something of an extension of the prior task, Determine the grain size of each source data object. Outstanding questions from that investigation should be compiled here, along with other data integrity concerns.

Issues raised at this stage should be directed back to the source data subject matter experts.

Key Diagnostics / Heuristics

For Import objects, do the names of the files loaded match expectations?
Do the total record counts for each source data object match expectations?
Do the count and names of columns match expectations?
Do any columns that would be expected to always or nearly always be populated contain only NULL?
Do the minimum and maximum values in columns containing numeric, standardized, or coded values (such as dates, dollar amounts) represent appropriate values?
Does the grain size of each file match expectations?

Detailed Implementation Guidance

If source data are loaded via an Import object, the ELT History panel shows you the details of the individual files loaded. including the name of the file, the Source Data Effective Date, and the row counts (both the count of rows added in an incremental load and the resulting row count). This information is especially useful in validating large batches of incrementally loaded source files.
Use the Case Review feature -- accessible in the upper right corner of Import or Registered Table object pages -- to review the records in each object.
In Case Review, use the Calculate aggregate statistics feature (by clicking the "information" icon found in the header of each column) to generate, for relevant fields, the minimum value, maximum value, and percentage of records with a NULL value.