Gather information on the source data
  • 01 Nov 2022
  • 2 Minutes to read
  • Dark
    Light

Gather information on the source data

  • Dark
    Light

Article Summary

Background and Strategy

“Source data” refers to any data that will be integrated by Ursa Studio and used to populate downstream objects in the client database. “Integration” describes the process of conforming source data to a standard and then pushing the conformed data into the Natural Object layer, often comingling there with data from other sources.

The first step in a successful integration is to gather all information available on the source data that is targeted for integration.

Ursa Studio is able to integrate tabular data in the form of either flat files (delimited or fixed width) or tables in an existing relational database that its service account has sufficient privileges to read from. It is designed to ingest and run data in batches, either as a full or partial (incremental) load. Within these parameters, Ursa Studio is able to accommodate virtually any source data and refresh schedule.

In contrast, Ursa Studio is not generally able to ingest non-tabular data. However, there is tooling that allows certain types of electronic documents – principally JSON documents – stored as text (as field value in a flat file or database table) to be parsed and mapped into traditional relational database structure.

Ursa Studio is not able to ingest individual records received in real-time. If data are needed from a source system that natively generates individual records in real time, those records will need to be collected in a staging file or table a to be imported together in batch format.

Key Diagnostics / Heuristics

  1. Are the source data available as flat files or as tables in the client database, or some combination of both?
  2. Do all the source data in flat files contain tabular data?
  3. For source data in flat files, what is the file names and formats – delimiter, quote or escape character(s), field start position and length (for fixed-width files), etc.?
  4. For source data in a relational database, what are schema and table names?
  5. For each file or table, what does each record represent?
  6. For each file or table, what does each column represent?
  7. For each column containing coded values, what are interpretations for each code?
  8. What special characters or strings are used to denote missing data (or other special concepts)?
  9. What is the schedule on which updated source data become available
  10. What is the desired schedule for refreshing the client database?
  11. What is the time lag between a record existing in its source system and becoming available to load into the client database?
  12. What is the expected record count (exact or approximate) of each table or flat file?

Detailed Implementation Guidance

None

Examples

None


Was this article helpful?