Generate a Source ID value for each source system
  • 01 Nov 2022
  • 3 Minutes to read
  • Dark
    Light

Generate a Source ID value for each source system

  • Dark
    Light

Article Summary

Background and Strategy

Ursa Studio requires that all objects containing raw source data – i.e., each Import Object or Registered Table Object – identify the Source ID of the originating source system. Assigning Source IDs thoughtfully can help data workers and consumers better understand where the data came from, and what freshness, completeness, and other characteristics they might expect.

More concretely, Source ID values are used in the data mastering process to generate values for data model key fields (i.e., those ending in “ID”, e.g., Patient ID, Provider ID, etc.) that are guaranteed to be unique. Therefore, distinct source systems with independent data generating processes for key values (e.g., two systems that could both assign the same patient table primary key value to two different patients) should be assigned different Source ID values to prevent key value collisions once combined.

However, it’s not always so clear whether two different points of origin for files and data tables should be formally considered different source systems and allocated different Source ID values.

With respect to the generation of Source ID values themselves, because they will be readily observable in the data (in the Source ID fields on certain objects, and as part of the mastered identifiers of nearly every object), they represent a convenient mechanism to communicate the origin of the data to users. To take advantage of this, Source IDs are manually generated by users in Ursa Studio – ideally, using intuitive, recognizable values – and not assigned automatically.

Key Diagnostics / Heuristics

  1. Do the source files or tables originate from different external organizations (e.g., claims files from two different payors)? If so, it’s probably appropriate to assign them different Source ID values.

  2. Do the source files or tables originate from different internal but distinct (and unintegrated) systems (e.g., from a payroll system and a billing system)? If so, it’s probably appropriate to assign them different Source ID values.

  3. Are the database identifiers used for most concepts (e.g., patient, provider, service, claim, etc.) consistent across the source files or tables? If so, that would tend to support considering the files or tables to be from the same source system.

  4. Do the source files or tables have the same data freshness and/or data completeness characteristics? If so, that would tend to support considering the files or tables to be from the same source system.

Detailed Implementation Guidance

  1. New Source ID values can be added in the Integration Manager zone (see Product Manual page here).

  2. Shorter Source IDs are preferred. Source ID values are prepended to local keys to generate unique data model keys, and longer keys make manual review of data less convenient and take up more space in the database. (Most modern databases do not suffer meaningful performance penalties handling joins or restrictions on longer text keys, but your mileage may vary.)

  3. It is often convenient and intuitive – and therefore generally a recommended approach – to identify external source systems using the name of the organization sending the data. (For example, setting Source ID to “HUM” for files in a particular Humana data package, or to “CMS” for files received from CMS.) However, sometimes the same external organization can send multiple data packages from independent systems that call for different Source ID values. Ideally, Source IDs should be chosen to be both intuitive but specific enough to protect against future collisions of this sort. For example, “HUM” might be fine for Humana, since it might be unlikely to receive a second, independent data package from Humana, but using “CCLF” instead of “CMS” for CCLF data received from CMS avoids problems if and when a different (i.e., independent of CCLF) data package from CMS must be integrated. However, it is also fine to simply assign Source ID values like CMS1, CMS2, etc.

Examples

Example 1: Multiple internal and external data sources

A provider organization receives claims data from three payors (Cigna, Humana, and United), and generates data files from two internal systems: a vendor-provided EMR and a home-grown consumer-facing web portal branded as "PatientLink". The EMR and PatientLink are not well integrated; for example, they use different internal identifiers for patients, providers, and other concepts.

This scenario likely calls for 5 distinct Source IDs. The three payor extracts are uncontroversially different source systems, and the two internal systems seem distinct enough to warrant different Source ID values -- for example, to facilitate mastering of patient and provider identifiers between the two regimes.

The following Source ID values would be sensible:

CIG
HUM
UH
EMR
PL

Was this article helpful?