Skip to main content

Table A3 Qualitative rubric used for labeling the Expert Annotated Dataset (Sect. 3.3) used for final model evaluation

From: CORAL: COde RepresentAtion learning with weakly-supervised transformers for analyzing data analysis

Stage

Definition

When to Use

When Not to Use

Example

Import

These cells are used primarily to import libraries into the Python environment. Although they may serve other functions, like defining constants or initializing helper objects, the majority of the code in these cells sets up analytical tools for use later in the notebook.

Loading libraries, defining constants, initializing environments, connecting to databases

A cell has one or more import statements, but most of the cell serves another purpose

figure h

Wrangle

Wrangle cells clean, filter, summarize, and/or integrate data. These cells often permute data for use in later cells.

Cleaning data, feature processing, data transformations, augmenting an existing dataset, loading and/or saving data, splitting data into train and test sets

Transformations are applied, but the result is simply examined (See: Explore)

figure i

Explore

Interactive explorations of data. These cells tend to yield a result that informs later decisions, or enable the user to draw new conclusions. Explore cells may also transform data, but only for the purpose of exploring relationships and not for further in-depth analysis

Rendering DataFrames, visualizing relationships, printing summaries of data, calculating simple statistics, examining the output of functions

Visualizations are used to evaluate the performance of a model (See: Evaluate)

figure j

Model

Define and fit models of relationships to data. These cells may include some data transformations, but the primary purpose is to create a model to describe or predict some facet of the dataset

Statistical modeling, fitting and/or specifying machine learning models, simulation, defining loss functions

Significance testing and calculating feature importance (See: Evaluate)

figure k

Evaluate

Measure the explanatory power or predictive accuracy of model using appropriate statistical techniques. These cells sometimes employ visualizations to explore analytical results (e.g. plotting regression residuals)

Cross validation, significance testing, inspecting model output, plotting feature significance.

If a cell both evaluates and defines a machine learning model (a common pattern), default to “Model”