Dash logoUC Berkeley logo

Code review regression analysis of open source GitHub projects


Thompson, Christopher; Wagner, David (2017), Code review regression analysis of open source GitHub projects, Dataset, https://doi.org/10.6078/D14X0T


This dataset contains the repository data used for our study "A Large-Scale Study of Modern Code Review and Security in Open Source Projects". This dataset was collected from GitHub, and includes 3,126 projects in 143 languages, with 489,038 issues and 382,771 pull requests. We also include the regression analysis notebooks for reproducing our results from this data.


We pulled from the sub-population of GitHub repositories that had at least 10 pushes, 5 issues, and 4 contributors from 2012 to 2014. We used the GitHub Archive, a collection of all public GitHub events, to generate a list of all such repositories. This gave us 48,612 candidate repositories in total. From this candidate set, we randomly sampled 5000 repositories. We wrote a scraper to pull all non-commit data (such as descriptions and issue and pull request text and metadata) for a GitHub repository through the GitHub API, and used it to gather data for each repository in our sample. After scraping, we had 4,937 repositories (due to some churn in GitHub repositories). For each language used by each repository, we manually labeled it on two independent axes: whether it was a programming language, and whether it is memory-safe. We used two quantification models (as explained in our paper) to estimate the number of issues in each repository that were security bugs. The results of each are in separate dataset files (`repos_data_nn.csv` and `repos_data_rfcc.csv`).

Usage Notes

Our main analysis (as reported in our paper) is contained in the Jupyter notebook `Regression.ipynb`. To run it, you need an active Jupyter instance running with the R kernel. We ran these analyses using R version 3.3.1 and the `ggplot2`, `reshape2`, `plyr`, `car`, `tibble`, and `ggfortify` packages. Full system details are available at the bottom of each notebook. Additionally, we include the extracted pure R versions of each notebook, as well as pre-rendered static HTML versions. These can be viewed without any installed software.