2.1 Data collection
We began by identifying various open source software packages that can serve as self-hosted alternatives to GitHub. These included GitLab Community Edition (CE), Gitea, Gogs, cgit, RhodeCode, and SourceHut. We limited ourselves to platforms with a web-git interface similar to mainstream centralized platforms like GitHub and GitLab, and so chose to exclude command-line only source code management like GitoLite, as well as more general project management software like Jitsi and Phabricator. For each software package, we identified a snippet of HTML from each package’s web interface that uniquely identifies that software. Often this was a version string or header, such as <meta content="GitLab" property="og:site_name">.
We then turned to Shodan [30] to find hosts running instances of each software package. Shodan maintains a verbose port scan of the entire IPv4 and some of the IPv6 Internet, including response information from each port, such as the HTML returned by each web server. This port scan is searchable, allowing us to list all web servers open to the public Internet that responded with our unique identifier HTML snippets. Notably, Shodan scans only include the default web page from each host, so if a web server hosts multiple websites and returns different content depending on the host in the HTTP request, then we will miss all but the front page of the default website. Therefore, Shodan results should be considered a strict under-count of public instances of these software packages. However, we have no reason to believe that it is a biased sample, as there are trade-offs to dedicated and shared web hosting for organizations of many sizes and purposes.
We narrowed our study to the three software packages with the largest number of public instances: GitLab CE, Gogs, and Gitea. Searching Shodan, we found 59,596 unique hosts. We wrote a web crawler for each software package, which would attempt to list every repository on each host, and would report when instances were unreachable (11,677), had no public repositories (44,863), or required login information to view repositories (2101). We then attempted to clone all public repositories, again logging when a repository failed to clone, sent us a redirect when we tried to clone, or required login information to clone. For each successfully cloned repository, we checked the first commit hash against the GitHub API, and set aside repositories that matched GitHub content (see Sect. 2.4). We discarded all empty (zero-commit) repositories. This left us with 45,349 repositories from 1558 distinct hosts.
Next, we wanted to collect a sample of GitHub repositories to compare development practices. We wanted a sample of a similar number of repositories from a similar date range, to account for trends in software development and other variation over time. We chose not to control for other repository attributes, like predominant programming language, size of codebase or contributorship, or repository purpose. We believe these attributes may be considered factors when developers choose where to host their code, so controlling for them would inappropriately constrain our analysis. To gather this comparison sample, we drew from GitHub Archive [17] via their BigQuery interface to find an equal number of “repository creation” events from each month a Penumbra repository was created in. We attempted to clone each repository, but found that some repositories had since been deleted, renamed, or made private. To compensate, we oversampled from GitHub Archive for each month by a factor of 1.5. After data collection and filtering we were left with a time-matched sample of 57,914 GitHub repositories.
Lastly, to help identify academic hosts, we used a publicly available list of university domains.Footnote 1 This is a community-curated list, and so may contain geographic bias, but was the most complete set of university domains we located.
2.2 Host analysis
We used geoip lookupsFootnote 2 to estimate the geographic distribution of hosts found in our Penumbra scan. We also created a simple labelling process to ascertain how many hosts were universities or research labs: Extract all unique emails from commits in each repository, and label each email as academic if the hostname in the email appears in our university domain list. If over 50% of unique email addresses on a host are academic, then the host is labeled as academic. This cutoff was established experimentally after viewing the distribution of academic email percentages per host, shown in the inset of Fig. 1(c). Under this cutoff, 15% of Penumbra hosts (130) were tagged as academic.
2.3 Repository analysis
We are interested in diverging software development practices between GitHub and the Penumbra, and so we measured a variety of attributes for each repository. To analyze the large number of commits in our dataset, we modified git2net [15] and PyDriller [38] to extract only commit metadata, ignoring the contents of binary “diff” blobs for performance. We measured the number of git branches per repository (later, in Fig. 2, we count only remote branches, and ignore origin/HEAD, which is an alias to the default branch), but otherwise concerned ourselves only with content in the main branch, so as to disambiguate measurements like “number of commits.”
From the full commit history of the main branch we gather the total number of commits, the hash and time of each commit, the length in characters of each commit message, and the number of repository contributors denoted by unique author email addresses. (Email addresses are not an ideal proxy for contributors; a single contributor may use multiple email addresses, for example if they have two computers that are configured differently. Unfortunately, git commit data does not disambiguate usernames. Past work [16, 42] has attempted to disambiguate authors based on a combination of their commit names and commit email addresses, but we considered this out of scope for our work. By not applying identity disambiguation to either the Penumbra or GitHub repositories, the use of emails-as-proxy is consistent across both samples. If identity disambiguation would add bias, for example if disambiguation is more successful on formulaic university email addresses found on academic Penumbra hosts than it is on GitHub data, then using emails as identifiers will provide a more consistent view.) From the current state (head commit of the main branch) of the repository we measure the number of files per repository. This avoids ambiguity where files may have been renamed, split, or deleted in the commit history. We apply cloc,Footnote 3 the “Count Lines of Code” utility, to identify the top programming language per repository by file counts and by lines of code.
We also calculate several derived statistics. The average interevent time, the average number of seconds between commits per repository, serves as a crude indicator of how regularly contributions are made. We refine this metric as burstiness, a measure of the index of dispersion (or Fano Factor) of commit times in a repository [6]. The index of dispersion is defined as \(\sigma ^{2}_{w}/\mu _{w}\), or the variance over the mean of events over some time window w. Previous work defines “events” broadly to encompass all GitHub activity, such as commits, issues, and pull requests. To consistently compare between platforms, we define “events” more narrowly as “commits per day”. Note that while interevent time is only defined for repositories with at least two commits, burstiness is defined as 0 for single-commit repositories.
We infer the age of each repository as the amount of time between the first and most recent commit. One could compare the start or end dates of repositories using the first and last commit as well, but because we sampled GitHub by finding repositories with the same starting months as our Penumbra repositories, these measurements are less meaningful within the context of our study.
Following Klug and Bagrow [24], we compute three measures for how work is distributed across members of a team. The first, lead workload, is the fraction of commits performed by the “lead” or heaviest contributor to the repository. Next, a repository is dominated if the lead makes more commits than all other contributors combined (over 50% of commits). Note that all single-contributor repositories are implicitly dominated by that single user, and all two-contributor repositories are dominated unless both contributors have an exactly equal number of commits, so dominance is most meaningful with three or more contributors. Lastly, we calculate an effective team size, estimating what the effective number of team members would be if all members contributed equally. Effective team size m for a repository with M contributors is defined as \(m = 2^{h}\), where \(h = - \sum_{i=1}^{M}{f_{i} \log _{2} f_{i}}\), and \(f_{i} = w_{i} / W\) is the fraction of work conducted by contributors i. For example, a team with \(M=2\) members who contribute equally (\(f_{1}=f_{2}\)) would also have an effective team size of \(m=2\), whereas a duo where one team member contributes 10 times more than the other would have an “effective” team size of \(m=1.356\). Effective team size is functionally equivalent to the Shannon entropy h, a popular index of diversity, but is exponentiated so values are reported in numbers of team members as opposed to the units of h, which are typically bits or nats. Since we only consider commits as work (lacking access to more holistic data on bug tracking, project management, and other non-code contributions [3]), \(f_{i}\) is equal to the fraction of commits in a repository made by a particular contributor. Interpreting the contents of commits to determine the magnitude of each contribution (as in expertise-detection studies like [9]) would add nuance, but would require building parsers for each programming language in our dataset, and requires assigning a subjective value for different kinds of contributions, and so is out of scope for our study. Therefore, the effective team size metric improves on a naive count of contributors, which would consider each contributor as equal even when their numbers of contributions differ greatly.
2.4 Duplication and divergence of repositories
It is possible for a repository to be an exact copy or “mirror” of another repository and this mirroring may happen across datasets: a Penumbra repository could be mirrored on GitHub, for example. Quantifying the extent of mirroring is important for determining whether the Penumbra is a novel collection of open source code or if it mostly already captured within, for instance, GitHub. Likewise, a repository may have been a mirror at one point in the past but subsequent edits have caused one mirror to diverge from the other.
Searching for git commit hashes provides a reliable way to detect duplicate repositories, as hashes are derived from the cumulative repository contentsFootnote 4 and, barring intentional attack [39] on older versions, hash collisions are rare. To determine the novelty of Penumbra repositories, we searched for their commit hashes on GitHub, on Software Heritage (SH), a large-scale archive of open source code [1] and within the Penumbra sample itself to determine the extent of mirroring within the Penumbra. Search APIs were used for GitHub and SH, while the Penumbra sample was searched locally. For each Penumbra repository, we searched for the first hash and, if the repository had more than one commit, the latest hash. If both hashes are found at least once on GitHub or SH, then we have a complete copy (at the time of data collection). If the first hash is found but not the second, then we know a mirror exists but has since diverged. If nothing is found, it is reasonable to conclude the Penumbra project is novel (i.e., independent of GitHub and SH).
To ensure a clean margin when comparing the Penumbra and GitHub samples, we excluded from our analysis (Sect. 2.3) any Penumbra repositories that were duplicated on GitHub, even if those duplicates diverged.
2.5 Statistical models
To understand better what features most delineate Penumbra and GitHub projects, we employ two statistical models: logistic regression and a random forest ensemble classifier. While both can in principle be used to predict whether a project belongs to the Penumbra or not, our goal here is inference: we wish to understand what features are most distinct between the two groups.
For logistic regression, we fitted two models. Exogenous variables were numbers of files, contributors, commits, and branches; average commit message length; average editors per file; average interevent time, in hours; lead workload, the proportion of commits made by the heaviest contributor; effective team size; burstiness, as measured by the index of dispersion; and, for model 1 only, the top programming language as a categorical variable. Given differences in programming language choice in academic and industry [37], we wish to investigate any differences when comparing Penumbra and GitHub projects (see also Sects. 2.1 and 3.3). There is a long tail of uncommon languages that prevents convergence when fitting model 1, so we processed the categorical variable by combining Bourne and Bourne Again languages and grouping languages that appeared in fewer than 1000 unique repositories into an “other” category before dummy coding. JavaScript, the most common language, was chosen as the baseline category. Missing values were present, due primarily to a missing top language categorization and/or an undefined average interevent time. Empty or mostly empty repositories, as well as repositories with a single commit, will cause these issues, so we performed listwise deletion on the original data, removing repositories from our analysis when any fields were missing. After processing, we were left with 67,893 repositories (47.26% Penumbra). Logistic models were fitted using Newton-Raphson and odds \(e^{\beta}\) and 95% CI on odds were reported.
For the random forest model, feature importances were used to infer which features were most used by the model to distinguish between the two groups. We used the same data as logistic regression model 2, randomly divided into 90% training, 10% validation subsets. We fit an ensemble of 1000 trees to the training data using default hyperparameters; random forests were fit using scikit-learn v0.24.2. Model performance was assessed using an ROC curve on the validation set. Feature importances were measured with permutation importance, a computationally-expensive measure of importance but one that is not biased in favor of features with many unique values [40]. Permutation importance was computed by measuring the fitted model’s accuracy on the validation set; then, the values of a given feature were permuted uniformly at random between validation observations and validation accuracy was recomputed. The more accuracy drops, the more important that feature was. Permutations were repeated 100 times per feature and the average drop in accuracy was reported. Note that permutation importance may be negative for marginally important features and that importance is only useful as a relative quantity for ranking features within a single trained (ensemble) model.