Models

Users can create models in the Advanced Analytics zone of Ursa Studio, then attach them to measures or objects. Models, like measures and objects, keep a full revision history for audit purposes, and allow one-click recovery into any previously saved version.

Linear and logistic regressions are supported, as are bespoke JavaScript and Python models. For logistic regressions, users can specify a binarization threshold, with 0.5 as the default. In the case of a logistic regression, the latent variable is also calculated and saved alongside the prediction.

Two different types of null management are supported: Either nulls can be ignored, or they can be filled in with the median value. Users can specify an intercept and any number of independent variables with coefficients.

Bespoke Models

Bespoke JavaScript and Python models will not be available for attachment to measures and objects per the usual hook. Rather, such models must be attached to the "Bespoke Model" object type.

The model is defined by a block of bespoke JavaScript or Python code, which will be run to generate a data frame, which will then be persisted into a new table. Any code can be entered, but it must only use libraries that are currently available in Ursa Studio. Because bespoke code can be of arbitrary complexity and power, only Bespoke Author user types are allowed to enter and manage these objects.

During ELT, the code will be persisted to a file and invoked with a set of arguments. By convention, the first two arguments will be the filename of the incoming CSV data frame, and the filename of the outgoing CSV data frame. It is the responsibility of the bespoke code to handle these first two arguments appropriately. An example of this usage can be seen in the placeholder text that appears in the entry box for the bespoke code.

The model can define the existence of an arbitrary number of extra parameters that can also be passed in during the invocation of the JavaScript or Python file. During model setup, users can define the type of the parameter to be string, number, or "options." For options, they can supply a comma-delimited list of allowed options.

During model setup, the expected fields of the incoming data frame must also be defined, as they are used in the bespoke code

Autopilot Models

Users in AWS deployments can create SageMaker autopilot models. To create an autopilot model, users pick an existing object for training and validation. They also select the relevant independent variable fields, as well as the target field. Users must also select a problem type, such as regression, and an objective, such as MSE, from the available options, depending on the nature of the target field to be predicted. Users can then kick off a training of the model. During training, AWS SageMaker will test 10 different models against the training data, and will select the model which performed the best on the chosen objective. Full details about model performance and validation can be found in SageMaker Studio.

Once an autopilot model is trained, it can be hooked up to a Bespoke Model object and run per the normal conventions of those objects. In order to unlock this feature in client deployments, two new environment variables must be added to Ursa Studio Fargate task definition, CLIENT_TAG and SAGEMAKER_IAM_ROLE_ARN. Some SageMaker resources must also be spun up, for which we can provide a CDK script.

Applying Models

Once profiled, any non-bespoke model can be added to a measure or an object. The independent variables must match up to measure fields. Then, upon instantiation, the model is generated, and its output is added as another field in the measure or object, where it acts like any other field.

Once created, a bespoke model can be attached to any number of Bespoke Model objects. During this attachment, users can enter the extra parameters as defined in the model. Users also select the upstream object that will generate the incoming data frame. The column names of the upstream object are likely to be different than the expected fields in the JavaScript or Python code, so users map the upstream object fields to the expected data frame fields. Not all fields are required to be mapped, but at least one must be. When a Bespoke Model object is created, it can be run in an ELT like any other object.