- 02 Nov 2022
- 5 Minutes to read
- Print
- DarkLight
Determine which concepts require mastering
- Updated on 02 Nov 2022
- 5 Minutes to read
- Print
- DarkLight
Background and Strategy
Data mastering is the process of identifying and reconciling different identifiers for the same instance of a concept -- typically patients and providers, but potentially other concepts as well.
Mastering is often needed when different data sources include information about the same real-world entities; for example, when the same patient is a member of multiple health plans contributing claims data to a centralized database, and therefore appears – under different identifiers – in multiple source data objects. (Note, however, that mastering might also be needed to link distinct identifiers from the same data source.)
One benefit of mastering is that all the information known about a particular instance of something can be properly gathered in one place (using a single, mastered identifier) and referenced more easily. For example, in the case of a patient whose claims history is spread across multiple data sources, successful mastering means that the patient’s full history of procedures and conditions (spanning the different data sources) can be evaluated together (using the same Patient ID).
A second benefit is that correct mastering removes duplicative records that could improperly inflate or deflate measure results.
Note that exact duplicate records – i.e., records with the same identifier and identical information – are not the appropriate target for mastering. Exact duplicates should just be removed, with no elaborate matching necessary to identify them (because they share the same identifier value). Rather, mastering should target “duplicative” records with different identifiers but which refer to the same real-life instance of something. For example, two records for the same patient in a patient table, captured a year or two apart, with different identifiers, and maybe with different addresses, but hopefully enough similarities – e.g., matching social security numbers – to allow the mastering logic to connect the dots.
In Ursa Studio, mastering is performed by Integrator objects. The resulting table takes the form of a crosswalk from each distinct “Source Local” identifier to a mastered data model key (e.g., Patient ID). Seeing a many-to-one mapping from multiple Source Local Patient ID values to the same Patient ID value, for example, would represent a scenario in which those multiple Source Local Patient ID values were matched to each other and determined to represent the same patient, who would be assigned, naturally, a single Patient ID.
Because it is very common to need to master patient and provider data, the Core Data Model comes pre-stocked with two such objects: Source Local Patient ID Crosswalk to Data Model Patient ID (for patients) and Source Local Provider ID Crosswalk to Data Model Provider ID (for providers). As a general rule, it is appropriate to master patients and providers in most integrations.
Determining which other source system concepts require mastering can be a judgement call, but the following three questions are useful to consider:
First, is there a dedicated object in the destination data model to store some or all instances of that concept? For example, the Core Data Model includes Natural Objects for patients, providers, and health plans, but does not include a Natural Object for employers. Until an Employers natural object is created, there isn’t the capacity – nor likely a pressing need – to master employer information found in the source system, because there will not be a data model key (e.g., Employer ID) under which to unify the different local identifiers (e.g., Source Local Employer ID) referring to the same real-life employer organization.
Second, how likely would it be, given the current and expected future data sourcing, to encounter multiple records with different information (identifier fields or otherwise) for the same real-life instance? In other words, Is there anything to master? For example, it would be rare to find two or more records for the same claim in any combination of payor data extracts. A Humana claim will not be found in a Cigna data feed, nor is it likely (though perhaps not impossible) for a claim in one Humana extract to be found “again” under a different identifier in another Humana extract. The vast majority of source system concepts are excluded from mastering under this rationale.
Third, if a particular concept were not mastered, how would the resulting analytic deliverables be affected in terms of accuracy and usefulness? For example, it might be the case that there are no current plans to use a particular concept – say, employer – in which case it would be defensible to save the effort of mastering.
An extreme version of this third criterion – which does arise from time to time – is a requirement that certain concepts should not be mastered. For example, consider a data environment containing data from multiple payor organizations, in which new analytics needed to mirror certain legacy reports currently in use within those organizations; using patient mastering to draw together claims for the same patient but originating from different payors data extracts would likely produce results that disagreed with those legacy reports. So even though mastering would yield a more complete picture of the patient’s utilization and clinical history, the usefulness of the resulting analytics might actually suffer.
Key Diagnostics / Heuristics
For each candidate source system concept being considered for mastering, is there a dedicated object in the destination data model to store some or all instances of that concept? If not, that candidate concept should probably not be mastered.
For each candidate source system concept being considered for mastering, how likely would it be, given the current and expected future data sourcing, to encounter multiple records with different information (identifier fields or otherwise) for the same real-life instance? The more likely it is to encounter such records, the more important it is to master that concept.
For each candidate source system concept being considered for mastering, if it were not mastered, how would the resulting analytic deliverables be affected in terms of accuracy and usefulness? The more accuracy and usefulness are improved by mastering, the more important it is to master that concept.
As a general rule, patients and providers should usually be mastered. Even if there is only one data source, that source might contain duplicative records for the same patient or provider with different identifiers.
Is the concept in question considered more a “fact” or “dimension” (using the fact vs. dimension distinction commonly used to organize data warehouse tables)? Mastering should primarily be considered for dimensions; mastering is seldom needed for facts. (Note that even if different fact records describe the same real-life event in a patient’s life – in the way, perhaps, an office visit might generate an appointment record in a scheduling system and a bill record in a billing system – it is usually more appropriate to consider those records to reflect different types of real-life events – e.g., the event of scheduling an appointment and the event of generating a bill – for which mastering would therefore not be appropriate.)