Determine appropriate storage locations for flat files
  • 01 Nov 2022
  • 6 Minutes to read
  • Dark
    Light

Determine appropriate storage locations for flat files

  • Dark
    Light

Article summary

Background and Strategy

Depending on the host environment, Ursa Studio offers different options for where source files can be stored to be accessible for import. Most Ursa Studio instances are hosted in AWS or Azure; in these deployments, files are typically retrieved from an S3 bucket or an Azure Storage Container, respectively. In on-premise deployments, files are loaded from a local folder on or directly accessible by the user’s computer.

When deployed in AWS or Azure environments, Ursa Studio has the capability to automate the ongoing importing of flat files. This capability relies on Ursa Studio’s ability to (1) determine that a file is a member of a known “file family” – that is, a collection of files with the same column structure representing repeated extracts of the same information over time; and (2) determine the most recent file in a file family. Ursa Studio must be able to make both of these determinations based exclusively on the file’s filepath – i.e., the concatenation of its folder location and filename. Consequently, some thought and planning must usually go into naming the files and the folders that contain them.

The Import object’s features related to automated file import are described in detail in the product manual, but the following guidelines represent the most important considerations:

  1. The source data effective date must be included in the filename in the same standard format (e.g., YYYY-MM-DD) across all files of the file family, and consistently in the same location in the filepath – either a fixed number of characters from the start of the filepath, or at the very end. When these conditions are not met, automated incremental imports are not possible, because Ursa Studio cannot identify the most recent file among others in the same file family.

  2. Ideally, the remainder of the filepath (i.e., the portions before and after the source data effective date) should be identical for all files in the same file family, and distinct from that of all other file families.

  3. Barring this, the file family should at least be uniquely identified by the portion of the filepath preceding any variable substring (such as the source data effective date) in the filepath. (Wildcard characters may be used to represent the variable-but-inconsequential portions of the filepath following the source data effective date.) When this condition is met, it will be possible to isolate all members of the family using only a prefix filter – i.e., a filter on the first n characters of the filepath, which is the only filter possible in the AWS and Azure APIs, and without which the number of matched files might exceed the API ceiling of 1000, meaning the desired files might not be found.

Generally, the Ursa Studio features that perform automated import require filename consistency to operate smoothly. In some cases, when there is sufficient variation between filenames in the same file family, and when renaming files is not convenient or possible, an appropriate solution is to simply calve off some files into their own Import object – effectively, to form smaller file families that enjoy more internal filename consistency – with the idea to union these partitioned groups back together at a later point in the data journey. For example, one such solution is that a single Semantic Mapping object is able to load any number of identically structured Import objects.

Note that the filepath location of the flat flat is captured by Ursa Studio whenever a file is loaded via Import object, and the filepath can be invoked during semantic mapping to generate new field values. So, for example, if the folder filepath includes a token denoting the time period its constituent files cover – say, a year, quarter, or month – that substring can be extracted and used to generate exact start and end calendar dates for the covered period.

Finally, it is not always possible to anticipate all the ways in which these flat file location filepaths may need to be used, and so the best policy is often to start with a sensible initial folder structure and simply plan for the possibility that the filename or storage folder names might need to change as those needs present themselves during the integration process.

Key Diagnostics / Heuristics

  1. Is there a need to automate the import of flat files? If not, the location in which flat files are stored are not especially important (though it is generally useful to keep flat files organized in a thoughtful way).

  2. If automated import of flat files is needed, do the filenames contain a source data effective date in a consistent location? If not, the file naming convention will likely need to be changed to ensure they are.

Detailed Implementation Guidance

  1. The AWS and Azure APIs for filtering and selecting files are case-sensitive, so the case of characters used in flat file filenames should also be consistent whenever possible.

Examples

Example 1: Well-formed filenames

A claims data package contains files for medical claims, pharmacy claims, and membership periods. The filenames in the initial tranche are:

medical_claims_20220104.txt
pharmacy_claims_20220104.txt
membership_periods_20220104.txt

New files are received each month; the following month’s files are named:

medical_claims_20220207.txt
pharmacy_claims_20220207.txt
membership_periods_20220207.txt

All these files can be stored in the same folder location because the file family is uniquely identified by the first part of the filename (i.e., before the variable YYYYMMDD date token), meaning a prefix filter can be used to narrowly identify the appropriate file family in each of the (three) Import objects. Unless the program runs for more than 1000 months (and older files are never removed from the folder), there will never be a problem retrieving the correct files from the AWS or Azure folder.


Example 2: Substring identifying the file family is located after the date

A monthly claims data package contains the following:

extract_20220104_medical_claims.txt
extract_20220104_medical_claim_diagnoses.txt
extract_20220104_pharmacy_claims.txt
extract_20220104_membership_periods.txt
extract_20220104_members.txt
extract_20220104_providers.txt

New files are received each month; the following month’s files are named:

extract_20220207_medical_claims.txt
extract_20220104_medical_claim_diagnoses.txt
extract_20220207_pharmacy_claims.txt
extract_20220207_membership_periods.txt
extract_20220104_members.txt
extract_20220104_providers.txt

The files here are all consistently named within the six file families, but we may eventually run into trouble because the filenames alone do not allow the use of a prefix filter. (A prefix filter of “extract_” would provide no help in isolating any individual file family.) Eventually, we may run up against the 1000 file ceiling.

A sensible approach here would be to create folders for each file family. The resulting following filepaths, for example, would allow the use of a prefix filter:

Medical Claims/extract_20220104_medical_claims.txt
Medical Claim Diagnoses/extract_20220104_medical_claim_diagnoses.txt
Pharmacy Claims/extract_20220104_pharmacy_claims.txt
Membership Periods/extract_20220104_membership_periods.txt
Members/extract_20220104_members.txt
Providers/extract_20220104_providers.txt

Example 3: Inconsequential but variable-within-family token is located before the family-identifying token

A claims package contains files covering members assigned to 2 different provider groups, coded ACO1 and ACO2 (the ACO1 and ACO2 files are identically structured):

aco1_medical_claims_20220104.txt
aco1_pharmacy_claims_20220104.txt
aco1_membership_periods_20220104.txt
aco2_medical_claims_20220104.txt
aco2_pharmacy_claims_20220104.txt
aco2_membership_periods_20220104.txt

We would like to treat the ACO1 and ACO2 files equivalently to avoid duplication of integration logic. (Note that we can always extract the identity of the attributee provider group based on the “aco1” / “aco2” token in the filename and create it as a new field during semantic mapping.) Unfortunately, the position of the provider group tokens makes a family-specific prefix filter impossible.

This is roughly equivalent to the problem faced in the prior example, and the solution is the same:

Medical Claims/aco1_medical_claims_20220104.txt
Pharmacy Claims/aco1_pharmacy_claims_20220104.txt
Membership Periods/aco1_membership_periods_20220104.txt
Medical Claims/aco2_medical_claims_20220104.txt
Pharmacy Claims/aco2_pharmacy_claims_20220104.txt
Membership Periods/aco2_membership_periods_20220104.txt

This folder structure will allow the use of a prefix filter unique to each file family (e.g., the prefix filter “Medical Claims/” will uniquely identify all identically structured files containing medical claims).

Notably, it is not recommended to create different subfolders for ACO1 and ACO2, unless those subfolders are themselves children of higher-level folders that uniquely identify the file family.


Was this article helpful?