AI usage in popular open source projects

The ecosystem and patterns around AI evolve every week at a rapid pace with release of new models, change in paradigms from MCP, agents, skills, etc. There has been a recurring question on the productivity improvement and more specifically Return on Investment from AI as large amount of investments are made by companies. Large corporations have also announced hundreds of billions of dollars in investment for 2026 in their latest earnings calls. One of the key focus on measuring productivity improvement is around of velocity of development. To put an analogy it has been a basic question in management entrance exams if 5 people take 10 hours to build something then 10 people can do it in 5 hours. It is possible for deterministic work to be measured this way but the nature of software engineering involves edge cases, business decisions, legacy code etc. that are dynamic in nature makes this harder. One bug might not be the same as another but fixing the bug same time again from scratch will take less time since you have acquired knowledge doing it the first time. Books like The Mythical Man Month go in detail about this.

I thought to analyse about AI usage along similar lines to see if it helped in productivity improvements in open source projects though I add here is a disclaimer that open source is different from a software engineering dayjob to be not taken as 1:1 projection. Open source projects typically have standard set of conventions for code and contributions. Most of the development happens in the open in GitHub in the form of issues and pull requests. So there is a question around is there actual increase in the number of features aided by AI and code contributed using AI since its easier now. In this post I will outline some of the popular open source projects using AI for contributions and their AI policies.

Apache Spark

Apache Spark is one of the most prominent project in data engineering space. From August 19, 2023 the PR template has been adjusted to provide guidelines around usage of AI in code contributions. Each PR needs to disclose if there was any AI tool used and how. With around 2.5 years of active development since the change and a major release of Apache Spark 4.0 I thought to clone parse all the commits to look for answer to the question “Was this patch authored or co-authored using generative AI tooling?” in the commit messages. Sample commits using AI and not using AI are as below.

Analysing the commits found below key points

Only 130 commits used AI and 8411 commits said no AI usage. That’s around 1-2% of all commits in 2 years using AI.
2024 only had 9 commits, 2025 had 23 commits and 2026 with less than 45 days since start had 35 commits already. The usage has been increasing as models get better throughout the years.
claude/sonnet/opus had 36 mentions followed by 17 mentions of copilot and 16 mentions of cursor. This indicates more usage of Anthropic models.

Commit not using generative AI

commit 5593f92a95033eaa960ea8f0d676e04fa6471ea7 (HEAD -> master, origin/master, origin/HEAD)
Author: Kousuke Saruta <sarutak@amazon.co.jp>
Date:   Fri Feb 13 16:08:47 2026 -0800

    [SPARK-55526][WEBUI][TESTS] Add `glob` package to `ui-test`

    ### What changes were proposed in this pull request?
    This PR proposes to add `glob` package to `ui-test`.

    ### Why are the changes needed?
    This package is necessary for E2E test to be added in #54315

    ### Does this PR introduce _any_ user-facing change?
    No.

    ### How was this patch tested?
    GA.

    ### Was this patch authored or co-authored using generative AI tooling?
    No.

    Closes #54320 from sarutak/add-glob-package-ui-test.

    Authored-by: Kousuke Saruta <sarutak@amazon.co.jp>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Commit using generative AI

commit 53e759ec8598d4641c1ca9b018f1f3401251d24a
Author: Uros Bojanic <uros.bojanic@databricks.com>
Date:   Fri Feb 13 23:26:41 2026 +0800

    [SPARK-55339][GEO][SQL] Implement WKT writer support for Geo objects

    ### What changes were proposed in this pull request?
    Enable `toWKT` for Well-known text (WKT) representation of geospatial objects.

    Same as WKB (Well Known Binary) support, our custom implementation avoids third-party dependencies such as JTS.

    ### Why are the changes needed?
    Text-based geospatial objects are necessary for result set support & thrift server enablement.

    ### Does this PR introduce _any_ user-facing change?
    No.

    ### How was this patch tested?
    Added new unit tests.

    ### Was this patch authored or co-authored using generative AI tooling?
    Yes, Claude 4.5 Opus.

    Closes #54114 from uros-db/geo-wkt-write.

    Authored-by: Uros Bojanic <uros.bojanic@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Here is the Python script to check for the question in the line and to see if the next non-empty line is an yes followed by additional comments about how AI was used.

import re
from collections import defaultdict
from datetime import datetime

from git import Repo

repo = Repo("/home/karthikeyan/stuff/spark")
key = "Was this patch authored or co-authored using generative AI tooling"
using_ai = defaultdict(int)

for commit in repo.iter_commits():
    for line in commit.message[commit.message.find(key) :].splitlines()[1:]:
        if line.strip():
            if match := re.search("(yes[\.\s,]?)(.*)", line.lower(), re.I):
                if match.groups()[1]:
                    print(
                        f"{datetime.fromtimestamp(commit.authored_date)} {match.groups()[1]}"
                    )
                using_ai[True] += 1
            elif match := re.search("^\s*no(.*)", line.lower(), re.I):
                using_ai[False] += 1
            break

print(f"Used AI: {using_ai[True]}")
print(f"Not Used AI: {using_ai[False]}")

$ git clone --shallow-since="2023-08-01" https://github.com/apache/spark.git
$ python process_repo_spark.py
2026-02-13 20:56:41  claude 4.5 opus.
2026-02-05 09:38:32  cursor with claude-4.5-opus-high was used to assist with coding and generate some of documentation and tests.
2026-02-05 06:16:06  generated-by: cursor ai assistant
2026-02-04 01:08:39  sonnet 4.5
2026-02-03 11:47:13 (`opus 4.5` on `claude code v2.1.5`)
2026-02-03 10:59:36  code assistance with claude opus 4.5 in combination with manual editing by the author.
2026-02-02 16:30:27  github copilot cli assisted with analysis and implementation.
2026-02-02 01:46:03  sonnet 4.5
2026-01-30 23:52:21  github copilot was used to assist with this change.
2026-01-30 20:03:51  github copilot was used to assist with this implementation.
2026-01-30 19:54:07  cursor 2.3
2026-01-30 19:25:38  github copilot was used to assist with this change.
2026-01-30 12:26:21  github copilot was used to assist with this change.
2026-01-30 09:04:00  code assistance with claude opus 4.5 in combination with manual editing by the author.
2026-01-29 08:24:26 - claude opus 4.5
2026-01-29 08:17:03 claude 4.5 sonnet
2026-01-29 01:21:45  code assistance with claude opus 4.5 in combination with manual editing by the author.
2026-01-28 20:07:17  cursor 2.4
2026-01-27 01:50:57  used claude-sonnet-4.5 for testing and refactoring
2026-01-23 15:20:06  cursor 2.3.41
2026-01-23 04:17:42  github copilot was used to assist with this change.
2026-01-22 06:04:19  claude 4.5 opus
2026-01-20 15:37:33  github copilot was used to assist with reviewing spark-54373 and applying the same pattern to spark-sql-viz.js.
2026-01-20 12:39:22  cursor 2.3.41
2026-01-20 09:15:49  cursor (claude-4.5-opus-high)
2026-01-20 09:04:23  cursor 2.3.41
2026-01-19 21:59:41  github copilot was used to assist with code development.
2026-01-19 06:25:19  cursor (claude-4.5-opus-high)
2026-01-16 23:50:15  `claude-4.5-opus-high` plus manual review and editing.
2026-01-16 17:04:44  cursor (claude-4.5-opus-high)
2026-01-16 07:08:16  sonnet 4.5
2026-01-10 01:39:08  claude-4.5-opus
2026-01-09 08:20:01  cursor with claude-4.5-sonnet was used to assist with coding and generate some of documentation and tests.
2026-01-08 02:22:24  claude-4.5-opus
2026-01-07 13:33:47  code assistance with claude opus 4.5 in combination with manual editing by the author.
2025-12-18 06:56:02  sonnet-4.5 and opus-4.5
2025-12-15 11:21:10  with assistance from `claude-4.5-opus-high` with manual review and adjustment.
2025-12-11 01:18:59  sonnet 4.5
2025-12-09 05:57:01  partly generated-by: claude code.
2025-12-04 05:23:03  code assistance with `claude-4.5-opus-high` in combination with manual editing by the author.
2025-12-03 23:10:22  haiku, sonnet.
2025-11-15 04:59:10  `claude-4.5-sonnet` with manual editing and approval.
2025-11-15 00:47:18  `claude-4.5-sonnet` with manual review and editing.
2025-11-13 12:48:06  co-generated-by cursor
2025-11-13 09:17:21  clause sonnet 4.5
2025-11-12 12:04:01  co-genreated-by cursor
2025-11-08 10:12:39  asked cursor for the draft.
2025-11-08 02:54:55  claude gave me suggestions to improve documentation.
2025-11-07 04:18:31  ide assistance used `claude-4.5-sonnet` with manual validation and integration.
2025-11-05 01:52:29  code assistance with `claude-4.5-sonnet` in combination with manual editing by the author.
2025-10-31 12:03:05  tests generated by cursor.
2025-10-21 04:12:19  with the help of claude code.
2025-10-08 23:20:31  generated-by cursor and 'claude-4-sonnet'
2025-08-29 04:47:22  used claude to generate the table printing utils. generated-by: claude-sonnet-4
2025-07-22 18:20:37  a little bit of copilot
2025-06-23 08:24:22  test data was generated using ai agent.
2025-02-05 02:29:00  copilot.
2025-01-20 05:28:01  to generate protobuf messages.
2024-11-15 06:47:13  copilot was used.
2024-11-13 21:00:16  copilot used.
2024-09-27 19:18:35  some code suggestions
2024-07-27 11:01:48  i used perplexity.ai to get guidance on converting some scala code to java code and java code to python code.
2024-05-25 09:08:07  generated-by: github copilot
2024-04-04 05:16:21  generated-by: github copilot 1.2.17.2887
2024-04-01 12:45:24  github copilot
2024-03-29 11:28:51  some of the code comments are from github copilot
2024-03-21 03:47:23  there are some doc suggestion from copilot in docs/sql-migration-guide.md
Used AI: 130
Not Used AI: 8411

Apache Airflow

Apache Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. I had been a committer to Apache Airflow since 2024. Airflow 3 was a major release where AI was used to translate the UI to be available in different languages. Recently the project also received a lot of PRs that don’t work with PR descriptions auto generated taking a toll on the maintainers. After an extensive discussion the contributing docs have been updated to require disclosure similar to Spark

CPython

CPython is the reference implementation for Python programming language. I had been a committer to CPython since 2020 but with less contributions in recent years. CPython doesn’t seem to have any official policy on AI usage attributions. Python is a crucial component in the AI ecosystem. Checking for commit messages since 01/01/2023 to have any mention of the models had few commits.

import re
from collections import defaultdict
from datetime import datetime

from git import Repo

repo = Repo("/home/karthikeyan/stuff/python/cpython")
using_ai = defaultdict(int)

for commit in repo.iter_commits(since=datetime(2023, 1, 1)):
    ai_used = False
    for line in commit.message.splitlines():
        if line.strip() and (match := re.search("(.*(claude|opus|sonnet|copilot|gpt).*)", line.lower(), re.I)):
            print(f"{datetime.fromtimestamp(commit.authored_date)} {match.groups()[0]}")
            ai_used = True
            break
        else:
            continue

    using_ai[ai_used] += 1

print(f"Used AI: {using_ai[True]}")
print(f"Not Used AI: {using_ai[False]}")

$ python process_repo_spark.py
2026-02-10 18:38:33 co-authored-by: claude opus 4.5 <noreply@anthropic.com>
2026-02-10 15:43:40 co-authored-by: claude opus 4.5 <noreply@anthropic.com>
2026-02-04 14:15:15 co-authored-by: claude opus 4.5 <noreply@anthropic.com>
2026-01-02 11:33:05 based on my exploratory work done in https://github.com/python/cpython/compare/main...gpshead:cpython:claude/vectorize-base64-c-s7hku
2025-12-05 22:47:01 co-authored-by: claude sonnet 4.5 <noreply@anthropic.com>
2025-11-29 11:37:03 co-authored-by: claude opus 4.5 <noreply@anthropic.com>
2025-11-29 09:55:06 🤖 generated with [claude code](https://claude.ai/code)
2025-08-09 10:59:51 * ungendered octopus
2025-08-06 02:20:51 co-authored-by: claude <noreply@anthropic.com>
2025-05-30 23:16:16 .gitignore personal claude code configs (#134942)
2025-03-06 04:01:42 commit-message-mostly-authored-by: claude sonnet 3.7 (because why not -greg)
2025-03-03 07:31:45 commit-message-mostly-authored-by: claude sonnet 3.7 (because why not -greg)
Used AI: 12
Not Used AI: 14537

.NET

.NET developers actively use copilot. It received a lot of attention when it was first introduced on Reddit. .NET core developers acknowledged this was intentional and it continues to be used across the repos. Filtering by copilot shows hundreds of PRs created across repositories under dotnet organization.

Reddit Post
PR discussion
Copilot runtime - 77 open, 566 closed
Copilot aspnetcore - 37 open, 200 closed

cURL

cURL is a tool for transferring data from or to a server using URLs. curl has over 20 billion installations and is used daily by virtually every Internet-using human on the globe as per the website. Due to the increased number of invalid security reports using AI taking a lot of maintainer time the project closed its bug bounty program last month.

AI policy

Open source project AI policies and discussions

While writing this post I also found below policies and discussion on AI across open source projects.

Projects with explicit ban on AI code

NetBSD - Code generated by a large language model or similar technology, such as GitHub/Microsoft’s Copilot, OpenAI’s ChatGPT, or Facebook/Meta’s Code Llama, is presumed to be tainted code, and must not be committed without prior written approval by core.
Gentoo

Conclusion

I had started using AI to let me explain a lot of things recently. I took help in understanding SQLAlchemy session lifecycle which helped in improving performance. AI is able to reason better and guide the developers or validate their assumptions so that the developer can implement it better. As with other tools the underlying fundamentals of good software itself cannot be outsourced or acquired without experience. So it’s a good quality to use the tool as necessary and still understand the tradeoffs of the implementation to arrive at the best case solution. This helps in mutual increase in productivity and also increases personal knowledge and experience that helps in becoming more proficient with the open source project in terms of maintenance once a contributor becomes a maintainer. It helps in developing an acquired taste towards good code and aversion towards bad code which is important in terms of long term maintenance especially in open source projects.

As the usage of AI grows there is also a strong stance from open source projects with reduced tolerance on AI slop. Since a lot of open source project maintainers already work on very limited amount of free time after work, family and hobbies it has become hard to dedicate the limited resource of personal time towards reviewing the ever growing code generated by AI backed virtually infinite pool of investments. AI should be used with adequate discretion and discipline. AI output should be reviewed by the author and manually tested as required to adhere to project’s guidelines just like before AI era.

Open source communities are built upon a lot of trust. Careful usage of AI increases trust while careless AI slop takes it in the other direction. Increase in AI usage without proper understanding of the implications increases the cognitive load of reviewing the code to maintainers who need to do more scrutiny than usual due to lack of trust. There is also a lot of push from vested interests in AI predicting end of software engineering every day. Criticism from open source maintainers with practical evidence should be healthy as each project is entitled to their own policies. Especially with the matplotlib incident where AI is unleashed upon maintainers that has also created a growing displeasure in the recent weeks about not only the contributions but the social implications of maintaining and accepting/rejecting a contribution. It should not to be conceived as anti-AI or as a sense of insecurity towards planned obsolescence but to take into account the limited resource of personal time from open source maintainers.