CORAL: COde RepresentAtion learning with weakly-supervised transformers for analyzing data analysis

Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We then evaluate the model on a new classification task for labeling computational notebook cells as stages in the data analysis process from data import to wrangling, exploration, modeling, and evaluation. We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplied heuristics and outperforms a suite of baselines. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest study of scientific code to date and find that notebooks which devote an higher fraction of code to the typically labor-intensive process of wrangling data in expectation exhibit decreased citation counts for corresponding papers. We also show significant differences between academic and non-academic notebooks, including that academic notebooks devote substantially more code to wrangling and exploring data, and less on modeling.


I. INTRODUCTION
Data analysis is central to the scientific process.Increasingly, analytical results are derived from code, often in the form of computational notebooks, such as Jupyter notebooks [1].Analytical code is becoming more frequently published in order to improve replication and transparency [2], [3], [4].However, as of yet no tools exist to study unlabeled source code both at scale and in depth.Previous in-depth analyses of scientific code heavily rely on expert annotations, limiting the scale of these studies to the order of a hundred examples [5], [6].Large-scale studies across thousands of examples have been limited to simple summaries such as the number or nature of imported libraries, total line counts, or the fraction of lines that are used for comments [6], [7], [8].The software engineering community has emphasized the inadequacy of these analyses, noting that "there is a strong need to programmatically analyze Jupyter notebooks" [9], while HCI researchers have observed that studying the data science § These authors contributed equally to this work.process through notebooks may play a role in addressing the scientific reproducability crisis [5], [10].
Automated annotation tools could enable researchers to answer important questions about the scientific process across millions of code artifacts.Do analysts share common sequential patterns or processes in their code?Do different scientific domains have different standards or best practices for data analysis?How does the content of scientific code relate to the impact of corresponding publications?To draw insights on the data science process, previous work has conceptualized the analysis pipeline as a sequence of discrete stages starting from importing libraries and wrangling data to evaluation [11], [12], [13].Building on this conceptual model, our goal is to develop a tool that can automatically annotate code blocks with the analysis stage they support, enabling large-scale studies of scientific data analysis to answer the questions above.
Analyzing scientific code is particularly difficult because as a "means to an end" [14], scientific code is often messy and poorly documented.Researchers engage in an iterative process as they transition between tasks and update their code to reflect new insights [15], [16].As such, a computational notebook may interleave snippets for importing libraries, wrangling data, exploring patterns, building statistical models, and evaluating analytical results, thereby building a complex and frequently non-linear sequence of tasks [5], [11].While some analysts use markdown annotations, README's, or code comments to express the intended purpose of their code, these pieces of documentation are often sparse and rarely document the full analysis pipeline [6].Domain-specific best practices, techniques, and libraries may additionally obfuscate the intent of any particular code snippet.As a result, interpreting scientific code typically requires significant expertise and effort, making it prohibitively expensive to obtain ground truth labels on a large corpus, and therefore infeasible to build annotation tools which require anything more than minimal supervision.
In this paper, we present COde RepresentAtion Learning with weakly-supervised transformers (CORAL) to classify scientific code into stages in the data analysis process.Im-portantly, the model requires only easily available weak supervision in the form of five simple heuristics, and does not rely on any manual annotations.We show that CORAL learns new relationships beyond the information provided by these heuristics, indicating that currently popular transformer architectures [17] can be extended to weakly supervised tasks with the addition of a small amount of expert guidance.Our model achieves high agreement with human expert annotators and can be scaled to analyze millions of code artifacts, uniquely enabling large-scale studies of scientific data analysis.
We describe a new task for classifying code snippets as stages in the data analysis process ( §III-A).We provide an extension to a corpus of 1.23M Jupyter Notebooks ( §III-B): a new dataset of expert annotations of stages in the data analysis process for 1,840 code cells in 100 notebooks, which we use exclusively for evaluation and not for training ( §III-C).
Next we describe CORAL ( §IV): a novel graph neural network model for embedding data science code snippets and classifying them as stages in the data science process.To capture semantic clues about the analyst's intention, CORAL uses a novel masked attention mechanism to jointly model natural language context (such as markdown comments) with structured source code ( §IV-B).We implement a weakly supervised architecture with five simple heuristics to compensate for the absence of labels, as labeling code requires domain expertise and is therefore expensive and infeasible at massive scale ( §IV-C).To further compensate for limited labels, CORAL combines this weak supervision with unsupervised topic modeling into a multi-task optimization objective ( §IV-E).
We evaluate our model ( §V) by comparing it to baselines including expert heuristics, weakly supervised LDA, and stateof-the-art neural representation techniques ( §V-A).We demonstrate that CORAL, using both code and surrounding natural language annotations, outperforms expert heuristics by 36% and significantly outperforms all other baselines.Through an ablation study we demonstrate that increased maximum sequence length M , weak supervision and unsupervied topic modeling all strictly improve performance, and that including markdown improves performance on cells without associated markdown by 13% ( §V-B).Further, we explore the impact of maximum input size and dataset size on our model's performance ( §V-C), showing that CORAL significantly outperforms all baselines even when trained on only 1k examples.In a comprehensive error analysis, we demonstrate that previously unseen data science functions are correctly labeled with appropriate analysis stages ( §V-D).
We then deploy our model to resolve previously unanswered questions about data analysis by linking academic notebooks and associated publications to conduct the largest ever study of scientific code ( §VI).We find that (1) there are significant differences between academic and non-academic papers, (2) that papers which include references to notebooks receive on average 22 times the number of citations as papers that do not, and (3) that papers linked to notebooks that more evenly capture the full data science process in expectation receive twice the number of citations for every one standard deviation increase in entropy between stages.
In summary, the contributions of this paper are: • A new task and public dataset for classifying Jupyter cells as stages in the data science process ( §III).• A multi-task, weakly supervised transformer architecture for classifying code snippets which jointly models natural language and code ( §IV).• A comprehensive evaluation of code representation learning methods ( §V).
• The largest ever study of scientific code ( §VI).We make all code and data used in this work publicly available at http://bdata.cs.washington.edu/coral/.
Due to the scarcity of labeled examples, most previous work learned code representations without supervision [30], [31], [32], [33], [22].The learned representations were mostly used for hole completion tasks, including the prediction of self-defined function names [30], API calls [32], [33], and variable names [31], [22].In contrast, our task -classifying code cells as analysis stages -arguably requires a higher level understanding of the intention of code.To overcome the bottleneck of manual labeling, we turn to weak supervision.Snorkel [34] combined labels from multiple weak supervision sources, denoised them, and used the resulting probabilistic labels to train discriminative models.Building on this idea, we introduce weak supervision to code representation learning by leveraging a small number of expert-supplied heuristics.
[MODEL] import pandas as pd

Example Notebook Cells
Stages to Predict Fig. 1.Examples of our proposed task of automatically labeling code snippets and accompanying natural language annotations as stages in the data science process, with code in blue and markdown in yellow.

C. Studies of Data Analysis Practices
There is significant existing research on understanding data analysis practices (e.g., [11], [13], [10], [15], [5]), mostly using qualitative methods to elicit experiences from analysts.Some interviews focused specifically on Jupyter notebook users [10], [6].Despite synthesizing rich observations, interview studies were limited to dozens of participants.A few studies conducted large-scale analysis of Jupyter notebooks, but were limited to simple summary statistics [6], a single library [7], or code quality [8].Our model enables the analysis of data science both at scale and in depth, which may validate and complement findings from previous qualitative studies.

D. The Data Science Process
A related branch of work [11], [13], [12] modeled the data analysis process as a sequence of iteratively visited stages.Other authors have noted that a better understanding of this process could improve scientific reproducability [5], aid in the development of new analysis tools [15], [10], and identify common points of failure [49].

III. PREDICTION TASK & DATASETS
We present a new task for labeling code snippets as stages in the data science process (Figure 1), identify a corpus of computational notebooks for large-scale training, and provide a new dataset of expert annotations that are used exclusively in the final evaluation.

A. Prediction Task
In order to automatically learn useful data science constructs from code, we propose a new task and accompanying dataset for classifying code snippets as stages in the data science process.Figure 1 shows five mock examples from this task.We task models with associating a snippet with one of five labels, which are drawn from and motivated by previous work: IMPORT, WRANGLE, EXPLORE, MODEL, and EVALUATE ( §II-D).IMPORT cells primarily load external libraries and set environment variables, while WRANGLE cells load data and perform simple transformations.EXPLORE cells are used to visualize data, or calculate simple statistics.MODEL cells define and fit statistical models to the data, and finally EVALUATE cells measure the explanatory power and/or significance of models.Additional details on these stages is available in Appendix Table VI.

B. Jupyter Notebook Corpus for Training
We curate a training set for this task by building upon the UCSD Jupyter notebook corpus, which contains all 1.23M publicly available Jupyter notebooks on Github [6].Jupyter is the most popular IDE among data scientists, with more than 8M users [50], [51], at least in part because it enables users to combine code with informative natural language markdown documentation.As noted by the corpus' authors, the dataset contains many examples of the myriad uses for notebooks, including completing homework assignments, demonstrating concepts, training lab members, and more [6].For the purposes of this paper we filtered the corpus to those notebooks that transform, model, or otherwise manipulate data by limiting our analysis to notebooks that import pandas, statsmodels, gensim, keras, scikit-klearn, xgboost or scipy.This leaves us with a total of 118k Jupyter notebooks, which we randomly split into training (90%) and validation sets (10%).These notebooks are not annotated with any ground truth labels of data science stages.Thus, we propose a combination of unsupervised representation learning and weak supervision to study them at scale ( §IV).

C. Expert Annotated Notebooks (Only Used for Evaluation)
Annotation.We randomly sampled 100 notebooks containing 1840 individual cells from the filtered dataset for handlabeling.The first two authors, who have significant familiarity with the Python data science ecosystem, independently annotated the cells with one of the five data science stages.The annotators performed a preliminary round of coding, discussed their results, and produced a standardized rubric for qualitative coding, which is available in the Appendix D (Table VI).The rubric clearly defines each data analysis stage and provides guidelines for when a label should and should not be used.Using this rubric, the annotators each made a second independent coding pass.We evaluate interrater reliability with Cohen's kappa statistic, which corrects for agreement by chance, and find the highest level of correspondence ("substantial agreement", κ = 0.803) [52].Finally, the annotators resolved the remaining differences in their labels by discussing each disagreement, producing a final dataset of 1840 cells for model evaluation ( §V).Our annotation rubric along with all data and code are available at http://bdata.cs.washington.edu/coral/.Importantly, these expert annotations are never used in training or validation including model selection, but only for the final evaluation ( §V).
Multi-Class v.s.Multi-Label.Both annotators paid close attention to potentially ambiguous cells while labeling, observing that it was quite rare for a single cell to be used for multiple stages of the data science process (less than 5% of the time).Furthermore, the median cell in the dataset had two lines of code, making it difficult for a cell to sufficiently express more than one stage.Low label ambiguity at the cell level and high inter-rater reliability support the formulation of this task as multi-class (i.e., five mutually exclusive labels) rather than [IMPORT] [WRANGLE] [EXPLORE] [EVALUATE] markdown code

Weak Supervision
Simple Heuristics Fig. 2.An overview of the architecture of our CORAL model, which combines weak supervision and unsupervised topic modeling into a multitask objective.For visual clarity, we only show edges from the AST here.In practice, we also use connections between [CLS] and all the others nodes, and between each AST node and markdown node (see Section IV-A).
multi-label (i.e., a cell may have one or more labels), and the selection of cells as the unit of analysis.

IV. THE CORAL MODEL
COde RepresentAtion Learning with weakly-supervised transformers (CORAL) is a model for learning neural representations of data science code snippets and classifying them as stages in the data analysis process.CORAL leverages both source code abstract syntax trees (ASTs) and associated natural language annotations in markdown text (see Fig. 2).

Model Contributions. CORAL contributes the following:
• CORAL jointly learns from code and surrounding natural language ( §IV-A), while preserving meaningful code structure through a graph-based masked attention mechanism ( §IV-B).We show that adding natural language improves performance by 13% on snippets that do not have associated markdown comments ( §V-B).• We address the lack of high-quality training data through an easily extensible weakly supervised objective based on five simple heuristics ( §IV-C).• CORAL combines this weak supervision with an additional unsupervised training objective (again to avoid costly ground truth labels) based on topic modeling, which we combine with other objectives in a multi-task learning framework ( §IV-E).

A. Input Representations
CORAL builds on graph neural networks [53] and maskedattention approaches [48] to encode the AST's graph structure by first serializing the tree and then using its adjacency matrix as an attention mask ( §IV-B).
We add additional nodes to the AST to capture surrounding natural language.For each code cell, we concatenate its most recent prior markdown as a token sequence to the AST graph sequence (yellow in Figure 2), so long as the markdown is no more than three cells away.Concretely, we create a node for each markdown token and then connect each markdown node with each AST node.Finally, we add a virtual node [CLS] (for classification) at the head of every input sequence and connect all the other nodes to it.Similar to BERT, we take this node's embedding as the representation of the cell [54].
Notation.Formally, let V = {u, v, ...} be the set of nodes in the input, where each node v is either an AST node or markdown token.For any input sequence that has more than M nodes, we truncate it and keep only the first M nodes (a modeling choice which we evaluate in §V-C).We use A to represent the graph adjacency matrix that encodes the relationship between nodes as described above.All input nodes are converted to embedding vectors of dimension d model .We assemble these embeddings into a matrix X.

B. Encoding Code Cells with Attention
We extend the popular BERT model [54] by adding masked multi-head attention to capture the graphical structure of ASTs.We evaluate the impact of this addition in §V-A.
CORAL feeds the input code and natural language representations to an encoder, which is composed of a stack of N = 4 identical layers (Fig. 2).Similar to Transformers [17], we equip each layer with a multi-head self-attention sublayer and a feed-forward sublayer.The graph structure is captured through masked attention (Eq. 2 below).
Masked Multi-Head Attention.We use Aggregate i k to represent the self-attention function of head i in layer k .Let (q, k, v) be the query, key, and value decomposition of the input to Aggregate i k .Queries and keys are vectors of dimension d k , and values are vectors of dimension d v .For a given node u, let (q u , k u , v u ) be the triple of query, key and value, and let N (u) be the set of all its neighbours.Formally, the parameters q u , k u , v u vary across each head i and layer k , but we drop additional notation for simplicity here.Then we compute aggregate results as: We adopt the scaling factor 1 √ d k from Vaswani et al. [17] to mitigate the the dot product's growth in magnitude with d k .In practice, the queries, keys, and values are assembled into matrices Q, K, V .We compute the output in matrix form as: where Ã = A+I is the adjacency matrix with self-loops added to implement the masked attention approach, where each node only attends to its neighbours (described in §IV-A) and itself.
Since we adopt multi-head attention, we concatenate h heads within the same layer: where , and W O ∈ R h * dv×d model are projection matrices that map the node embeddings X to queries, keys, values, and multi-head output, respectively.Feed Forward.In each layer, we additionally apply a fully connected feed-forward sublayer.This is composed of two linear transformations with ReLU activation in between: where Add & Norm.Each sublayer is followed by layer normalization [55].The output of each sublayer is: where Sublayer(x) is multi-head attention or feed forward.
Output.The multi-head attention sublayer and feed-forward sublayer are stacked and make up one "layer".After stacking this layer four times, the encoder's output contains representations of all the nodes in the input sequence.We take the embedding of the [CLS] node as the representation of the each notebook cell's graph (Section IV-A), denoted as z ∈ R d model .We compress this cell representation z into a lower-dimensional distribution over K "topics" to capture information about the data analysis stages.Concretely: where W topic ∈ R K×d model is the weighted matrix parameter and b is the bias vector.

C. Weak Supervision
It is prohibitively expensive to obtain manual annotations of data analysis stages at scale, as doing so would require thousands of person-hours of work by domain experts.Therefore, we use five simple heuristics to tailor CORAL to the prediction task described in §III-A: 1) We collect a set of seed functions and assign each to a corresponding stage based on its usage.Any cell that uses a seed is weakly labeled as the corresponding stage.For example, any cell that calls "sklearn.linearmodel.LinearRegression" is weakly labeled MODEL.The full set of 39 seed functions is available in Appendix A. We demonstrate CORAL's ability to correctly classify unseen code outside these functions in §V-D.2) A cell with one line of code that does not create a new variable is weakly labeled EXPLORE.This rule leverages a common pattern in Jupyter notebooks where users often use single line expressions to examine a variable, such as a DataFrame.3) A cell with more than 30% import statements is labeled IMPORT.4) A cell whose corresponding markdown is less than four words and contains {logistic regression, machine learning, random forest} is weakly labeled MODEL.5) A cell whose corresponding markdown is less than four words and contains cross validation is weakly labeled EVALUATE.Note that there may be conflicts between these rules.We observe that less than one percent of cells in our corpus comply with more than one of these heuristics, further supporting our decision to formulate labels as mutually exclusive.We resolve any such conflicts by assigning priority in the following order: IMPORT, MODEL, EVALUATE, EXPLORE, WRANGLE. 1 In this layer, we aim to compute p stage -a probability distribution over these six stages -from the topic distribution computed in Eq. 7. We implement this by mapping the topic distribution p topic to a probability distribution p stage over the n stages = 6 stages.We compute the stage distribution p stage as follows, where W stage ∈ R K×nstages : We adopt cross entropy loss to minimize classification error on weak labels.For each p topic , loss is computed as: where y o,s is a binary indicator (0 or 1) if stage label s is the correct classification for observation o and p s is the predicted probability p stage is of stage s.
The five weak supervision heuristics cover about 20% of notebook cells in the training data.To minimize the model's ambiguity on the remaining 80% of unlabeled data, and encourage it to choose a stage for each topic, we add an additional loss function.Concretely, we add an entropy term to p stage to encourage uniqueness by forcing the topic distribution to map to as few stages as possible: where p s is the predicted probability p stage [s] for stage s.This entropy objective is minimized when p s = 1 for some s and p s = 0 all other s .

D. Unsupervised Learning Through Reconstruction
As the weak supervision heuristics only cover about 20% of the cells, we enrich the model with additional training through an unsupervised topic model.Here, the goal is to optimize the topic representation p topic such that we can reconstruct the intermediate cell representation z.We reconstruct z from a linear combination of its topic embeddings p topic : where R ∈ R d model ×K is the learned cell embedding reconstruction matrix.This unsupervised topic model is trained to minimize the reconstruction error.We adopt the contrastive max-margin objective function using a Hinge loss formulation [56], [57], [58].Thus, in the training process, for each cell, we randomly sample m = 5 cells from our dataset as negative samples: where D is the training data set, r c is reconstructed vector of cell c, z c is intermediate representation of cell c, and n i is the reconstructed vector of each negative sample.This objective function seeks to minimize the inner product between r c and n i , while simultaneously maximizing the inner product between r c and z c .We also employ a regularization term from He et al. [59] to promote the uniqueness of each topic embedding in T : where I is the identity matrix and R norm is the result of L2row-normalization of R.This objective function reaches its minimum when the inner product of two topic embeddings is 0. We demonstrate in §V-B that this additional unsupervised training improves overall classification performance.

E. Final Optimization Objective
We combine the loss functions of Equations ( 9),( 10),( 12), and ( 13) into the final optimization objective: where λ 1 , λ 2 , λ 3 and λ 4 are hyperparameters that control the weights of optimization objectives.We experiment with various training curricula and find that CORAL with the hyperparameters in described in Appendix B achieves the best loss (Eq.14) on the validation set.Importantly, this optimization and model training is based on solely on the labels from weak supervision heuristics.We do not use expert annotations ( §III-C), which we exclusively reserve for the final evaluation.

V. EVALUATION
CORAL achieves accuracy of over 72% on the stage classification task using an unseen test set (Section III-C), outperforming a range of baseline models and demonstrating that weak supervision, unsupervised topic modeling, and adding markdown information all strictly improve overall classification performance.

A. Baseline Comparison
In Figure 3(a) we compare CORAL's performance to eight baselines, which we describe below.Importantly, the lack of ground truth labels in our training set makes it impossible to evaluate a model that does not use some amount of weak supervision, as without these heuristics we cannot map between learned topics and data science stages.Expert Heuristics (Weak Supervision Only).How well does a simple baseline perform that considers only library information?For example, pandas is commonly used to wrangle data, and scikit-learn is common in modeling.We compare against an improved version of this baseline, where we include all expert heuristics described in §IV-C.This set of heuristics consider function-level and markdown information in addition to library information.This is a natural comparison since this is the exact weak supervision used in CORAL.These heuristics cover only 20.38% of the test examples, so we choose one stage uniformly at random otherwise.
LDA Representation + Weak Supervision.How important is it to use a deep neural encoder for our task?To address this question, we replace CORAL's encoder with a Latent Dirichlet Allocation (LDA) [60] topic model, but use the same input data ( §IV-A), and the same weak supervision (Section IV-C).Specifically, we optimize this model with L weak supervision (Eq.( 9)) and L unique stage (Eq.( 10)) on top of the unsupervised LDA representation.We first used the same number of LDA topics (50) as we use in CORAL (i.e. the size of the cell representation p topic ).However, this baseline only performed at the level of the Expert Heuristics Only baseline.In order to make this baseline stronger we doubled the number of LDA topics to 100, which did improve performance.
Neural Baselines.How well does a noncontextual neural model perform on our task?What are the benefits of using the graphical structure of ASTs instead of treating code cells as sequences of tokens in deep neural networks?How important is the multitask objective that combines weak supervision heuristics and an unsupervised topic model?To address these questions, we compare CORAL against the noncontextual Word2Vec [61] model and the state-of-the-art language model, BERT [54], which have both been previously applied to source code representation learning [62], [63].We trained all neural baselines with both markdown and code using the same pretraining corpus as CORAL.To explore the sensitivity of these models to their input representations, we tried both treating the code as sequences of tokens and as serialized ASTs.For  Training with more weak supervision significantly improves performances.
the BERT baselines we used the standard architecture with the same embedding size as CORAL and masked language model pretraining.Predictions are made with a single layer using the same weak supervision heuristics as CORAL.We evaluated BERT baselines both with and without ASTs and finetuning.
When finetuning, we backpropogated the single layer's loss through the encoder.After pre-training, we optimized the model with L weak supervision (Eq.( 9)) and L unique stage (Eq.( 10)) on top of the learned representations of code cells.
Results.Results from these experiments are available in Figure 3(a).The Expert Heuristics baseline achieves 34.1% accuracy on the unseen expert annotations described in §III-C.Even though it uses the same amount of supervision, CORAL is 38% more accurate than this baseline, demonstrating that CORAL learns significantly more than simply memorizing the heuristic rules.CORAL also favorably compares to state-ofthe-art neural language models, beating the highest performing BERT baseline by 4.3%.We observe that while popular deep learning techniques like finetuning produce only a marginal difference in model performance, CORAL significantly outperforms all other baselines (Wilcoxon signed rank, p < 0.001).

B. Ablation Study
We just demonstrated in §V-A that CORAL improves significantly over expert heuristics, representations that do not leverage graphical structure, and state-of-the-art neural models.Here we show that (1) adding markdown information, (2) weak supervision, and (3) additional unsupervised training all independently improve the performance of CORAL, as shown in Figure 3(b).Across all experiments we use maximum sequence length of M = 160 and train on the maximum 1M code cells, based on the best performing model overall.Performance consistently increases with more training data but remains promising even with three orders of magnitude less training data.
CORAL without Markdown.For this ablation, we remove any markdown information from the input sequence, while keeping all other aspects of CORAL the same.We compare maximum sequence length of 80, 120 and 160 since the maximum sequence length M may interact with markdown information due to truncation ( §IV-A).We find that including markdown information consistently and significantly improves performance 12% at M = 160, even though less that 9% of cells are directly preceded by markdown (Table I).Furthermore, these comparatively rare comments significantly improve performance even on cells that do not have corresponding markdown information from 59.6% to 72.6%, suggesting that markdown cells help CORAL better represent source code independent of these comments.
CORAL with Less Weak Supervision.The weak supervision heuristics described in §IV-C cover about 20% of the training examples.We simulate lower coverage by randomly subsampling 50% and 25% of these weakly labeled examples (i.e., 10% and 5% of all examples).Higher weak supervision coverage dramatically increases performance, but even at 25% of examples CORAL still outperforms CORAL (No Masked Attention) by 10% and BERT by 15% (Table II).
CORAL without Unsupervised Topic Model.This baseline evaluates the marginal benefit of CORAL's unsupervised topic model.Specifically, we remove L unsupervised (Eq.12), and L unique topic (Eq.13) from CORAL but keep everything else the same.We show that the unsupervised training objective improves overall accuracy by 10% (Figure 3(b)).This demonstrates the significant potential of combining limited weak supervision with additional unsupervised training in a multitask framework.

C. Impact of Input Length & Training Set Size
Maximum Sequence Length.We investigate how model performance changes with the maximum input sequence length M (see Table I).For CORAL models with and without markdown, a larger maximum sequence length consistently  -C) we hypothesize that this confusion may be the result of the scikit-learn use pattern where users specify and evaluate their models in the same cell.
Example Predictions.We highlight three predictions in Figure 4 to demonstrate CORAL's ability to capture data analysis semantics and inherent ambiguity.In Figure 4(a), the user transforms a pandas DataFrame and calls pandas.DataFrame.groupby, a function typically used to aggregate data.While a naive method (e.g., the expert heuristic baseline in §V-A) might label the cell as WRANGLE, CORAL infers that the analyst's intention is to use this user-defined function to evaluate a classifier with a confusion matrix, likely making use of the information in the comment and function parameters, and appropriately labels the cell as EVALUATE.
In Figure 4(b), the analyst loads data, selects a subset, creates a plot, and fits a linear regression.CORAL correctly identifies this example as serving to both modify data and look for patterns, but assigns a higher probability to EXPLORE, demonstrating its ability to capture the significance of previously unseen statistical visualization methods like seaborn.regplot.
In Figure 4(c), the analyst calls a user-defined function.While CORAL has never seen this function or notebook, it still correctly identifies the intent of the cell as EXPLORE likely by attending to tokens like "plot" and "breakdown".

VI. LARGE SCALE STUDIES OF SCIENTIFIC DATA ANALYSIS
Our model and datasets provide an opportunity to pose and answer previously unaddressable questions about the data analysis process, the role of scientific analysis in academic publishing, and differences between scientific domains.We note that our corpus ( §III-B) is limited to the most recent (potentially partial) snapshot of the user's analysis and that the observational nature of this data prohibits any causal claims.
A. Are There Differences Between Academic Notebooks and Non-Academic Notebooks?
Differences between academic and non-academic notebooks could identify how practices vary across these communities.
Method.The Semantic Scholar Open Research Corpus (S2ORC) is a publicly available dataset containing 8.1M fulltext academic articles [64].In order to relate these papers to relevant source code, we performed a regular expression search across the corpus for any reference to a GitHub repository, returning associations between 2.0k papers and 7.1k notebooks from the UCSD corpus.We use this dataset to resolve previously unanswerable questions about the role of analysis code in the scientific process.Although there is no strict guarantee that a linked notebook contains the data analysis that was used to create the paper, the median notebook is linked to exactly one paper, indicating some degree of injectivity from notebooks to papers.Furthermore, manual inspection of our dataset and prior work indicate that researchers often break their analysis up across many notebooks, which may explain why papers link to multiple notebooks.So as not to bias our analysis against how a scientist decides to structure their code, we compute statistics for each paper by concatenating all associated notebooks.We compute the fraction of code devoted to each data analysis stage and the fraction of cells that are followed by a cell of a different stage and examine differences between academic and non-academic notebooks.
Results.Academic notebooks devote 56% more code to exploring data and 26% less code to developing models than non-academic notebooks (Figure 5(a)).Furthermore, we note that analysts on average use only 23% of their code for the traditionally boring and laborious process of wrangling data.While the relative size of the stage likely does not accurately reflect the relative effort of data wrangling, it is perhaps surprising that such a maligned stage of the process [11] is represented by a comparatively low fraction of all code.We also find significant differences in the fraction of cells that are followed by a cell of a different stage (Figure 5(b)).
Most interestingly, cells are in general more likely than not to transition to a different stage.This result supports the hypotheses that notebooks follow a transitory process through the data science process to complete an analysis rather than dwelling on any particular stage.

B. Is the Content of Notebooks Related to the Impact of Associated Publications?
Evidence of a relationship between scientific notebooks and publication impact may encourage researchers to publish their code, and could reveal differences between the priorities placed on scientific data analysis by different domains.
Method.We employ a negative binomial regression to estimate the impact of notebook stage distribution on the number of citations their associated papers receive.We hypothesize that notebooks which evenly and comprehensively document their analysis (rather than focusing on just one part) may receive more citations.In our first regression R1, we therefore regress citation count on the Stage Entropy = − k p k log p k , where p k is the fraction of the notebook that is devoted to stage k.This captures the uniformity of the distribution of stages across a paper's associated notebooks.Here, we normalized this quantity across all publications by taking the Z-score.We controlled for a paper's year of publication and domain.To reveal differences between disciplines, we build upon this experiment with a second regression R2, which includes all terms from R1 except for the entropy term, but adds interaction variables between the Z-scores of the fraction of each paper's notebook devoted to each data analysis stage and paper domains to capture differences between disciplines.additional details for these regression models are available in Appendix F.
Results.We find that papers that link to notebooks have 10 β hasN otebook = 10 1.34 ≈ 21.88 times more citations than papers that do not reference a notebook (95% CI: [1.29, 1.41], p < 0.001).From R1 we note that Stage Entropy is strongly related to the number of citations a publication receives, as those publications can expect a 10 β stageEntropyZ = 10 0.33 ≈ 2.11 times increase in citations with an entropy level for each standard deviation above the mean (95% CI: [0.26, 0.39], p < 0.001) This result suggests that researchers may value notebooks which evenly document the whole data science process, rather than highlighting just one part of analysis.These results also indicate that a notebook with one standard deviation more than the average EXPLORE code would expect 10 β EXP LORE = 10 −0.4325 ≈ 0.35 times the citations in its associated paper than a notebook with an average quantity of all stages (95% CI: [-0.64,-0.22],p < 0.001).One possible explanation for this effect is that notebooks which feature a high volume of code for exploring data are associated with generating hypotheses, and may therefore be associated with incomplete or exploratory publications that are less likely to attract references.The results from R2 (Figure 6) indicate significant differences between domains.Most notably, we find that in computer science and mathematics an increase in the portion of code devoted to wrangling data decreases the citation count in expectation, while no such interaction is present for papers from biological sciences.We hypothesize that the most popular cited notebooks in computer science and mathematics may cleanly demonstrate new techniques and models, rather than documenting an extensive data wrangling pipeline.
We note that although these effect sizes may seem large, we need to consider that the median citation count for papers is only two.This implies that even with a high citation multiplier, papers with just a few citations would expect a rather moderate increase in citations.

VII. CONCLUSION
We presented CORAL, a novel weakly supervised neural architecture for generating representations of code snippets and classifying them as stages in the analysis pipeline.We showed that this model outperforms a suite of baselines on this new classification task.Further, we introduced and made public the largest dataset of code with associated publications for scientific data analysis, and employed CORAL to answer open questions about the data analysis process.

Fig. 3 .
Fig.3.Accuracy on expert-annotated test set for all baselines (a) and ablation studies (b).Performance improves with neural topic models and weak supervision.CORAL significantly outperforms all baselines (Wilcoxon signed rank, p < 0.001) and ablation studies (p < 0.05)

Fig. 4 .
Fig. 4. Example predictions.Probability distributions over stages from CORAL's SoftMax output (Eq.(7)) are listed on the right side.In (a), CORAL correctly identifies the cell as EVALUATE rather than WRANGLE, likely by interpreting "confusion matrix", perhaps based on previously seen markdown.In (b), the model identifies the use of sns.regplot, an unseen statistical visualization function, as an example of EXPLORE.In (c), CORAL correctly interprets a user-defined function.

Fig. 6 .
Fig.6.Results from (R2), indicating differences in how paper impact in different domains is related to the content of associated notebooks.
Training Dataset Size.We evaluate the accuracy of CORAL and two other high-performing models with different training dataset sizes to gauge how sensitive our model is to training data size.We fix M to 160 and train with a maximum of 1M notebook cells.In all other experiments, we use the maximum 1M notebook cells for training.While performance consistently decreases with smaller training data (TableIII), CORAL achieves an accuracy of 61.85% with only 1k examples and outperforms baselines by a large margin.This demonstrates that the CORAL architecture is effective at learning useful code representations even in smaller-data scenarios, such as on the order of magnitude of a typical GitHub repository.