GitHub-Microsoft’s Other Subsidiary Is an Underrated AI Giant

OpenAI is the poster child for artificial intelligence mainly because of its high-profile lobbying of global governments and the popularity of its ChatGPT large language model. But GitHub is arguably just as pivotal in the industry, hosting many open-source AI and machine learning models, apps, codes, patches, and discussions. It’s where AI developers can engage with the community and even provide customer service.

GitHub is also a defendant in the first class action lawsuit against AI companies. A class action filed on behalf of a consortium of open-source programmers alleges GitHub Copilot directly plagiarizes their code without obtaining consent nor giving proper compensation and credit. And it hosts several NSFW forks of Stable Diffusion often used for illicit purposes like creating undress apps and nonconsensual deepfake porn.

Legal action against all three of these companies (which will be heavily influenced by the US Copyright Office’s public call for comment) will significantly impact what happens moving forward. Although they say they act independently, it’s clear GitHub is essential to Microsoft and will likely have the strength of its corporate legal department behind it.

Let’s dive into what GitHub is doing and what the alternatives are.

TL;DR

  • GitHub is among the most popular communities for sharing AI code, datasets, and tutorials.
  • Although it encourages open source, its status as a Microsoft subsidiary concerns many developers.
  • GitHub Copilot is a coding-specific generative AI model developed by GitHub, OpenAI, and Microsoft. It’s integrated into a variety of applications and platforms.
  • Copilot regularly regurgitates exact code from its repository, and lawsuits allege this infringes upon the IP rights of the license holders.
  • Although alternatives from GitLab to SourceForge exist, they all face the fundamental problem of being scraped for AI training.
  • Advocates worry Open-Source in the age of AI may allow enterprises to monopolize data, defeating the original intent of democratizing resources for all. 

Background

About Microsoft

Microsoft Corporation, an American multinational technology giant headquartered in Redmond, Washington, dominates the tech landscape with its vast software and hardware products. Founded by Bill Gates and Paul Allen in 1975, the company initially focused on developing BASIC interpreters for the Altair 8800.

Microsoft quickly rose to prominence in the 1980s by dominating the personal computer operating system market with MS-DOS, followed by Windows. Their 1986 IPO turned three employees into billionaires and roughly 12,000 into millionaires.

Although highly successful, Microsoft hasn’t been without its challenges. Criticized for monopolistic practices, the company also faced issues regarding the ease of use, robustness, and security of its software. Because of its deep pockets and litigious nature, it remains one of the Big Five in the American tech industry, alongside Alphabet, Amazon, Apple, and Meta Platforms.

It’s also a leader in AI, with subsidiaries including OpenAI and GitHub leading the pack of AI development and innovation. Its Azure platform also competes against the likes of Amazon Web Services (AWS) and Google Cloud for hosting these processes.

About GitHub

GitHub is the world’s largest platform for software development and version control, offering cloud-based services that use Git to enable developers to store, manage, and collaborate on code. Founded in 2007 by Tom Preston-Werner, Chris Wanstrath, P. J. Hyett, and Scott Chacon, GitHub grew exponentially to host over 372 million repositories and serve more than 100 million developers as of January 2023. Headquartered in California, it became a subsidiary of Microsoft in 2018.

Initially a bootstrapped startup, GitHub received significant venture capital funding over the years, including a $100 million investment from Andreessen Horowitz in 2012 and another $250 million in 2015 from a group of investors. The company was valued at around $2 billion before its acquisition by Microsoft for $7.5 billion. Under Microsoft’s ownership, GitHub says it continues to operate independently, focusing on community engagement and open-source development.

GitHub offers a range of features beyond code hosting, including access control, bug tracking, task management, and continuous integration. It has played a pivotal role in the open-source movement, hosting millions of public repositories that contribute to the global software ecosystem. The platform also engages with the educational sector, providing free resources to schools through GitHub Education.

Despite its success, GitHub has controversies, including harassment allegations in 2014 that led to organizational changes. It has also faced skepticism from the developer community concerning its acquisition by Microsoft, although the platform has maintained its core focus and expanded its services since then, including acquisitions like Semmle and npm. Today, GitHub remains a cornerstone of the coding world, shaping how developers collaborate and build software, and this includes hosting a variety of AI models and code.

About OpenAI

OpenAI is a U.S.-based AI enterprise disguised as a research lab, founded in 2015 with a focus on developing “safe and beneficial” artificial general intelligence (AGI). Initially a non-profit, OpenAI transitioned to a “capped” for-profit model in 2019 to attract investments and top talent. Microsoft has been a key investor, infusing $1 billion in 2019 and bringing the total to $13 billion in 2023.

OpenAI introduced generative AI products like GPT-3 and DALL-E to the market, and has commercial partnerships with Microsoft. Its mission, governance, and for-profit transition have been subjects of public discussion. As of 2023, OpenAI is valued at $29 billion and has made strategic acquisitions like the New York-based start-up Global Illumination. It is on track to earn $1 billion in revenue in 2023.

The company faces criticism for outsourcing the annotation of toxic content to Sama, a company that paid its Kenyan workers between $1.32 and $2.00 per hour post-tax, while OpenAI paid Sama $12.50 per hour. The work allegedly left some employees mentally scarred.

OpenAI has also been criticized for a lack of transparency around technical details of products like ChatGPT, Dall-E 2, and GPT-4, which makes it difficult for independent research and goes against its initial commitment to openness. Legal issues have also emerged, including copyright infringement lawsuits from authors and potential action from The New York Times, as well as a lawsuit for violating EU General Data Protection Regulations. In response to this, the EU formed the European Data Protection Board (EDPB) in April 2023 for better oversight.

Examining GitHub Copilot

GitHub Copilot is an AI-powered code completion tool developed by GitHub in collaboration with OpenAI and Microsoft. It is based on OpenAI’s Codex programming model (although it also leverages GPT-4) and can output functional code in any known language. It initially was solely a Visual Studio Code extension but since expanded into other apps and platforms.

Copilot assists developers by suggesting whole lines or blocks of code as they type. It draws from a large dataset of public code repositories (including its own on-platform  to offer its suggestions, making it theoretically useful for various programming languages and tasks.

It’s marketing as a tool to increase productivity by reducing the need to write boilerplate code, search for code snippets, or read documentation for simple tasks. It can generate code for loops, functions, and even more complex structures like web scraping tasks or API integrations.

However, Copilot’s suggestions should be reviewed carefully for accuracy, security, and compliance with best practices. It’s also important to understand where the code comes from, as it can contain glitches, exploits, and other problems.

Growing Concerns of GitHub Copilot

Although marketed as a miracle tool, GitHub Copilot is not highly regarded among the developer community. Approximately one million developers use the tool, which accounts for a small percentage of the 27.7 million developers worldwide in 2023.

This small pilot group using Copilot often complain about problems like “crappy code” that is less secure than than code written by hand, according to a study from Stanford University titled, “Do Users Write More Insecure Code with AI Assistants?

Even if the code does not contain malware, it can still fail to account for zero-day exploits, let alone known attack vectors that have been patched over the years. And Copilot itself can act like malware, leaking secrets like keys, credentials, and passwords. Major corporations like Samsung ban employees from using ChatGPT and potentially leaking proprietary company secrets.

Hilariously, Microsoft itself accidentally exposed 38TB of its own internal secrets onto GitHub this week that was potentially picked up by data scrapers and could be used in cyber crimes committed by competing models like HackGPT, which are focused primarily on creating AI-generated malware.

It’s unclear how much its small userbase relies on it, but a GitHub report claims the tool is particularly effective for less experienced developers, boosting their productivity. On average, users accepted Copilot’s code suggestions about 30 percent of the time, with the rate of acceptance increasing as developers became more familiar with the tool. However efficient it becomes, the outputs still carry inherent legal dangers.

The US Copyright Office currently does not allow AI-generated outputs to be eligible for copyright protection, which is a growing concern. This means that companies using it in their workflows could face increased legal liability in trying to protect their work.

Copilot also sparked discussions around the ethics of code generation, specifically concerning code licensing and the use of open-source code to train the model.

Software developers argue that Copilot and Codex were created using their code without explicit permission, and sometimes even reproduce it, disregarding the terms under which they licensed their work.

U.S. District Judge Jon Tigar so far found it plausible to consider an amended complaint that provides more clarity on the supposed injuries arising from Copilot and Codex. If Copilot or Codex can generate code specifically attributable to one of the plaintiffs, it would be difficult for GitHub, Microsoft, and OpenAI to reconcile with the law.

The case elicited strong opinions, with some developers fearing harm from using their code without permission. GitHub and Microsoft argued that the plaintiffs’ fears were unreasonable, but the judge disagreed, allowing the plaintiffs to continue their claim pseudonymously based on threats received.

Both companies remain optimistic about the case’s prospects, reiterating their commitment to responsible innovation with Copilot, but many are seeking alternatives.

Viable GitHub Alternatives

If you’re considering migrating away from GitHub, the process is generally straightforward but requires careful planning. Ensure you have a local copy of all your repositories, issues, and other data you wish to transfer.

Many platforms offer import tools that can help you move your repositories with minimal hassle, preserving commit histories, branches, and tags. Before making the move, update your README files and documentation to inform your user community about the transition and where they can find the new repositories.

After migrating your code, don’t forget to adjust any dependencies or links that point to your old GitHub repositories. Update the CI/CD pipelines, webhooks, and any other integrations you might have in place to reflect the new repository URLs.

You may also want to consider submitting a pull request to any projects that depend on your libraries to update the repository URLs. This ensures a smooth transition and minimizes disruptions to your development workflow.

Here are some GitHub alternatives to consider:

GitLab

GitLab is a robust web-based DevOps platform that provides a wide range of features for software development, including source code management, continuous integration, and continuous delivery. Unlike GitHub, GitLab offers the flexibility of self-hosting, meaning organizations can run it on their own servers. It is open-source at its core, which allows for extensive customization. GitLab offers built-in CI/CD tools, enabling developers to build, test, and deploy code from within the platform itself. Its features like issue tracking, code reviews, and a variety of pre-built project templates make it a comprehensive solution for software development teams.

SourceForge

One of the earliest platforms for version control, SourceForge offers Git, Mercurial, and Subversion repository support. It also provides a set of tools like issue tracking, project wikis, and forums. Unlike GitHub, SourceForge allows for the distribution of software directly through the platform, making it popular for open-source projects that need to distribute binary files. It provides some unique features like an integrated backup system and a vast library of open-source software available for download.

Bitbucket

Owned by Atlassian, Bitbucket tightly integrates with other Atlassian products like Jira, Confluence, and Trello. It offers Git repository management but also supports Mercurial, a feature not available in GitHub. Bitbucket is often considered a go-to option for teams already invested in the Atlassian ecosystem. It provides private repositories for free and offers robust features like inline commenting to streamline code reviews. Bitbucket Pipelines, a CI/CD feature, allows for easy automation of the development workflow.

Moving Forward

GitHub, owned by Microsoft, is a key player in the AI and open-source communities. It is facing a class-action lawsuit alleging that its Copilot tool plagiarizes code from open-source programmers without consent or proper credit.

Legal action against GitHub, Microsoft, and OpenAI could have industry-wide implications. Copilot is touted as a productivity tool, but instead raises concerns about code security, compliance, and intellectual property rights. While some developers find Copilot useful, many have concerns about its implications and are seeking alternatives.

Just remember that leaving GitHub doesn’t makes you any safer from having your code used to train its Copilot AI.

One Comment