BPP Task Template

Background and Strategy

Semantic mapping is the process by which source data are conformed to destination data model's semantic conventions – i.e., standardized concepts and field names, as well as standardized field values and data types.

For example, a semantic convention might be that the field containing the ICD-10-CM code documented as the admitting diagnosis on an institutional claim should be named Admitting Diagnosis ICD-10-CM Code, and that all ICD-10-CM codes should be expressed as text values, and that values longer than 3 digits should include a decimal point after the third digit. The process of semantic mapping would rename the appropriate source system field(s) to Admitting Diagnosis ICD-10-CM Code, ensure that values are cast as a standard text data type, and insert the appropriate decimal point when necessary.

Semantic mapping is not responsible for what might be considered "structural" changes in the data, like joining source system tables together to form wider, denormalized tables, or reshaping wide-form data to long form; and the process of semantic mapping always preserves the original grain size of the source data objects. (These "structural" transformations are performed later in the integration, largely in the Local Transform Layer.)

In practice, one can think of semantic mapping as enforcing field-level standardization, while subsequent stages of integration transform the table structure of source data to conform to the desired shape and grain size. This sequencing is not arbitrary: the (later) structural transformation benefits from the (earlier) semantic mapping, which ensures that field names are recognizable and field values are much more predictable, which, naturally, makes manipulating the data easier.

Semantic mapping should only be skipped when the source data already conform to all the desired semantic conventions. In practice, nearly every integration will benefit from semantic mapping, even for small tables with only a few fields, or tables whose fields don't seem to require too much cleaning. Don't underestimate the value of a dedicated step that ensures all the field-level standards have been applied properly. Many pernicious bugs have emerged from a failure to do so.

Assuming the intention is to perform semantic mapping, the next question is: what Semantic Mapping objects should be created?

As stated above, semantic mapping preserves the grain size of the source data. The typical approach is therefore to create one Semantic Mapping object for each source data object needed for the integration -- a one-to-one relationship between source objects and Semantic Mapping objects.

A special case is when the same original source dataset is partitioned into multiple source data objects with identical structures (i.e., same columns, same data types, same grain size, but perhaps different record counts). In this case, it is better to re-consolidate those multiple objects back into a single (semantic mapping) object -- resulting in a many-to-one relationship between source objects and Semantic Mapping objects.

Key Diagnostics / Heuristics

Are any of the source data objects Import objects? If so, because all fields imported from flat files are stored in the database as text, semantic mapping is always appropriate to handle casting values to the desired data type (e.g., date, integer, float, etc.).
Are the source data objects all Registered Table objects whose field names and values already conform to the desired semantic standards (field names, data types, code formats, etc.) used by developers and analysts? (For example, if Ursa Studio is installed on top of an established enterprise data warehouse.) If so, semantic mapping is probably not necessary.

Determine what Semantic Mapping objects are needed

Background and Strategy

Key Diagnostics / Heuristics