- 09 May 2023
- 8 Minutes to read
- Print
- DarkLight
Create Import objects and perform initial loads
- Updated on 09 May 2023
- 8 Minutes to read
- Print
- DarkLight
Background and Strategy
In contrast to the relatively simple Registered Tables discussed in the last task, there are a lot of different ways to configure an Import Object. The Ursa Studio product manual (see here) provides detailed information on this screen, but five important dimensions of configuration are discussed in detail below.
1. Where to find the file(s):
(An earlier task, Determine appropriate storage locations for flat files, discussed where flat files should be placed to facilitate automated imports; here we describe how to access those files.)
Depending on the host environment, Ursa Studio offers different options for where source files can be stored to be accessible for import. Most Ursa Studio instances are hosted in AWS or Azure; in these deployments, files are typically retrieved from an S3 bucket or an Azure Storage Container, respectively. The control used to select the appropriate bucket or container is labelled Storage Container Name or Bucket Name, respectively.
An unpleasant quirk of the APIs governing access to S3 buckets and Azure Storage Containers is that searching for files can only be done using the filepath “prefix”. For example, if one were looking for the file “Folder 1/Subfolder 1A/my_test_file.csv”, a search string of “my_test_file” would not return a match but “Folder 1/Subfolder 1A/my_test” would. The control labelled Filename Filter Prefix, found in the Import Object Configuration panel, allows users to specify a prefix filter, which then populates the Filename dropdown control to include all matching files.
The prefix filter is especially important because of another idiosyncrasy in the AWS API (which, for consistency, Ursa Studio imposes on Azure as well), which is that no more than 1000 filenames are returned from the bucket or container (whether or not a prefix filter is used). If the number of matching files exceeds 1000, it is necessary for the prefix filter to reduce that number below 1000 or risk the possibility that one of the desired files doesn't make it into the 1000 filenames. In other words: the more specific the prefix filter, the better.
2. How to impute the Source Data Effective Date of the source data from the filename of the imported file(s)
The freshness of a source data file is captured by its Source Data Effective Date, representing the “as of” date of the data in the originating source environment at the moment the file was generated. (Note that this concept is distinct from the Record Last Updated Date, which represents the most recent date that the record was created or modified; among other differences, this is a characteristic of each record, not of the entire file, as is the case with the Source Data Effective Date.) In the case of a single-file, manual load, the Source Data Effective Date can be entered by hand; for all other scenarios –automated imports, batch imports of multiple files at once – each file’s Source Data Effective Date must be imputed from its filename. Ursa Studio provides a means to perform this imputation with the Date Format Mask and Mask Start Character controls.
In the case of a batch load of multiple files (see below for more details on this), the imputation of Source Data Effective Date is performed using the same configuration for each file in the batch. This requires that all the filenames in the batch are named consistently, with the date in the same location in every file.
3. How to parse the column structure of the file(s)
Ursa Studio can import data from delimited flat files and fixed-width files.
For delimited files, users must identify the delimiter character (or select Tab-Delimited) and a “quote” character used to provide field values that might contain a delimiter.
For fixed-width files, users must provide a data dictionary identifying the column names and the start character or column width of each field. The data dictionary must be loaded from a local source (e.g., the user's computer), and can take one of two formats: a CSV file with two columns: column_name and column_width; or a CSV file with three columns: column_name, column_start_index, and column_end_index. (A column_start_index value of 1 implies the field value starts on the first character of each line.)
4. How to load multiple files incrementally
When performing an initial load, it is not uncommon to need to import multiple historical files to “catch up” to current day. These files may be loaded one at a time or as a single batch run (see below for more detail on this option), but either approach may encounter the need to perform an “incremental load”.
In general, when updating an existing table with a new data file, there are two commonly used approaches: First, in a “full refresh”, all previously loaded records are dropped and the table is completely overwritten using the latest file; second, in an “incremental load”, new records are appended to the existing records. Incremental loads can be optionally configured to replace existing records if a new record has the same key value(s) as an existing record; effectively, the original record is updated with the (presumably more recent) information from the new record.
Full refreshes are conceptually simple – you just delete everything and start over – but each new file must then contain every historical record that must be included. In contrast, incremental loads are somewhat more complicated to set up, but may save processing time if only a small fraction of records have been added or modified since the last load. (It’s also possible for incremental loads to actually be slower than full refreshes due to some of the overhead associated with the delete and insert operations, so actually testing both approaches is advised.)
Import objects can be configured to perform either of these two approaches, including the optional replacement operation in the incremental load approach. The choice of full vs. incremental refresh is made by selecting the Overwrite existing table and Append to existing table options, respectively; if Append to existing table is selected, the user can additionally specify one or more fields in the object as “append keys”, which are used to identify matching records for replacement.
(5) How to perform a batch load of multiple files at once
When a historical load involves a large number of files – e.g., multiple years of monthly extracts – it is convenient to load a batch of multiple files at once with a single run of an Import Object.
This can be accomplished in Ursa Studio by specifying a range of files to load using the Range End Filename control, which results in the import of all files with filepaths “between” – when arranged in alphabetical order – the file specified in Filename control (to be loaded first) and the file specified in the Range End Filename control (to be loaded last). Note also that both bookend filepaths can be set to use the "latest" file matching the pattern defined in the earlier controls, effectively allowing users to define a growing or rolling catchment area for files to be included in the batch.
In case this range includes some files not intended for import, a Secondary Filter for Range may be specified, which further filters the resulting files to only those that contain the filter string. (Unlike the Filename Prefix Filter control described above, this filter is applied against any part of the filepath, not just the start of it.) The final list of files to be included in the batch import (i.e., resulting from both the range definition and the secondary filter) are displayed in the Preview of Files in Range subpanel.
Note that, in a batch load, as one would expect, all files after the first are imported using the “incremental load” approach. As when configuring a single-file incremental load, the user can optionally identify the key fields that will trigger a replacement; these fields can be specified in the Append Key for Subsequent Files control.
(It might seem like an unwieldy design choice to identify a batch of files by identifying a start and end filename and then gathering up those and all the filenames in between, with an optional secondary filter applied. Why not just allow the user to manually select the files they want? The problem with that manual approach is that it would not provide Ursa Studio with the information to perform ongoing automation of batch runs, which is often desirable, whereas the rules-based approach currently implemented does.)
Key Diagnostics / Heuristics
(1) Does the import package contain multiple files with the same column structure representing different historical periods of the same content? If so, these should probably be loaded into the same Import Object, using the batch load functionality to save time.
(2) If the intent is to impute the Source Data Effective Date from the filename, do their filepaths include a data effective date in a consistent location? If not, this should be remedied somehow.
Detailed Implementation Guidance
The bucket or container names in AWS or Azure deployments are not governed by Ursa Studio and so are not standardized, but the standard convention is to use the [client name]-ursa-import for the bucket or container designated to stage files for import.
Users can elect to not impute Source Data Effective Date from the filepath. However, Ursa Studio does require some data be assigned each import, so either manually enter a date or the current date will be used as a default.
Ursa Studio can unzip and import files that have been compressed in a zip or gzip file by selecting the is Zipped checkbox.
The first 11 rows of the selected file (typically the header row and first 10 records) can be reviewed using the Preview File button.
It is often convenient to populate the Filename Prefix Filter control after initially selecting the desired file in the Filename control and then pruning the full filepath down to an appropriate prefix substring. There is a clipboard icon to the immediately right of the Filename control that will copy its contents to clipboard to facilitate this.
The controls governing batch loading are revealed by clicking the “Multi-Import Range of Files” link at the bottom of the Configuration panel.
The appropriate Source ID should be selected in the Source control.
Once the desired configuration is set, the import is initiated by clicking the Upload file icon in the upper right corner of the screen. The current configuration can be saved without initiating an import using the Save object without uploading icon to its immediate left.