Analyzing bugs.python.org
I was looking to contribute to CPython and I made my first PR which consisted of running aspell through the code looking for typos. I was quite amazed at how smooth the process is. CPython moved to GitHub last year and this has opened up to a lot of integrations like GitHub bots, PR being linked to the issue tracker and so on. This also helps people who want to contribute to the project to be up and running with less friction. CPython was using subversion before moving to mercurial in 2011. They again moved to git and GitHub last year. You can read the full history in the post by Brett Cannon. From time to time there were also easy issues that are triaged by the core developers for beginners to pick along with providing mentorship. This also helps in growing the team of contributors who can help in fixing and triaging the issues.
Similar posts I wrote on Clojure and Rust ecosystem.
- Analyzing around 10k modules in Clojure ecosystem to find JDK 9 issues
- Analyzing crates.io to find top rust dependencies and insecure dependencies
Obtaining bugs.cpython.org data
I wrote a little scraper that downloads each issue and since it’s a static page we need to parse the page to obtain relevant information. Since this is a quick hack I thought MongoDB will be a good fit where you can just dump the data as JSON and query easily. Hence a issue along with the title, python version against which it’s filed and comments looks as below. The code and JSON files are available on GitHub
{
"_id" : ObjectId("5b335855e63827190a6c1f75"),
"component" : [
"Tests"
],
"version" : [
"Python 3.8"
],
"title" : "Issue 33853: test_multiprocessing_spawn is leaking memory - Python tracker",
"content" : [
{
"author" : "Author: Pablo Galindo Salgado (pablogsal) *",
"date" : "Date: 2018-06-13 15:10",
"comment" : "The test `test_multiprocessing_spawn` is leaking memory according to the x86 Gentoo Refleaks 3.x buildbot:\n\n\nx86 Gentoo Refleaks 3.x\nhttp://buildbot.python.org/all/#/builders/1/builds/253\n\ntest_multiprocessing_spawn leaked [1, 2, 1] memory blocks, sum=4\n1 test failed again:\n test_multiprocessing_spawn\n\n\nx86 Gentoo Refleaks 3.7\nhttp://buildbot.python.org/all/#/builders/114/builds/135"
},
{
"author" : "Author: STINNER Victor (vstinner) *",
"date" : "Date: 2018-06-13 15:30",
"comment" : "Duplicate of bpo-33735."
}
]
}
Top issues by number of comments
I was interested to study issues that involved a lot of discussions that could provide a context of how a decision is made or a tough issue is fixed. Since we have content which has an array of comments I group by the length of the comments and then obtain the top issues by number of comments. The gives us the top issues by number of comments made along with the state of the issue. Adding a new regex module compatible with re is the most commented issue and it’s also still open. The issue was created on 2008-04-15 by timehorse.
Top authors by number of comments
Since CPython is driven by volunteers along with companies who use Python a lot employing core contributors I was also interested in the authors who made a lot of comments that is proportional to their activity in the bug tracker. This was a case of unwinding the content array that had the comments and then grouping by the author. This gives us the top authors by number of comments made. Victor Stinner commented more on the bug tracker followed by roundup robot which adds comments when a merge is made along with the changeset information and so on.
author | count |
---|---|
Author: STINNER Victor (vstinner) * | 15556 |
Author: Roundup Robot (python-dev) | 13359 |
Author: Serhiy Storchaka (serhiy.storchaka) * | 13147 |
Author: Antoine Pitrou (pitrou) * | 12480 |
Author: R. David Murray (r.david.murray) * | 8177 |
Author: Terry J. Reedy (terry.reedy) * | 6231 |
Author: Nick Coghlan (ncoghlan) * | 4939 |
Author: Éric Araujo (eric.araujo) * | 4776 |
Author: Mark Dickinson (mark.dickinson) * | 4251 |
Author: Raymond Hettinger (rhettinger) * | 4183 |
Top issues by Python version
Bug tracker has a mechanism where you can specify the Python version against which you want to file an issue. Often times there are multiple versions affected where you need to fix in the latest version and then backport them to supported versions. The issue distribution with respect to Python version shows Python 2.7 to be the most popular version with around 9.5k issues filed both open and closed. This is followed by Python 3.4 and other Python 3 versions.
version | count |
---|---|
Python 2.7 | 9516 |
Python 3.4 | 6387 |
Python 3.5 | 6295 |
Python 3.6 | 6037 |
Python 3.3 | 5480 |
Python 3.2 | 5012 |
Python 3.7 | 4269 |
Python 2.6 | 2997 |
Python 3.1 | 2650 |
Comment distribution
I also made a query to return the distribution of number of comments per issue.
msgs | count |
---|---|
2 | 5197 |
3 | 4437 |
4 | 3574 |
5 | 2843 |
6 | 2302 |
7 | 1818 |
8 | 1498 |
1 | 1272 |
9 | 1222 |
10 | 1060 |
Future development
Data enables us to visualize interesting points with the bug tracker being a central place with a lot of historical information and going through the bug tracker often gives us an idea of the efforts and details with respect to an issue providing more context. There was also a discussion on using GitHub issues for future development which could help in more integrations and smooth workflow. The devguide is also available on GitHub where you can suggest improvements to the workflow. CPython development has come a long way so far with GitHub and looking forward to more improvements on CPython development.
If you like the above post then you might like similar posts below :
- Analyzing around 10k modules in Clojure ecosystem to find JDK 9 issues
- Analyzing crates.io to find top rust dependencies and insecure dependencies
Closes emacs and pushes to git