Open Source Codeium Challenges GitHub Copilot, Strips Out Non-Permissive GPL Code
Free and open source Codeium has launched an assault on the front-running, for-pay GitHub Copilot tool in the coding assistant space.
Along with being free of OpenAI hegemony, a key selling point in that assault is that Codeium, while providing similar code-completion capabilities, does not emit code with non-permissive licensing such as GPL (General Public License). Even though the GPL license guarantees end users the four freedoms to run, study, share and modify software, it's described as a non-permissive license.
All that is explained in last Thursday's (April 20) blog post titled "GitHub Copilot Emits GPL. Codeium Does Not."
Basically, Codeium says permissive licenses (for example MIT, BSD and Apache) let people use code for commerce or any other reason, but non-permissive licenses such as GPL prohibit such usage without consent. Codeium, developed by the deep learning specialist company Exafunction, uses the MIT license. Exafunction's GitHub repos include code for using Codeium in Vim and Neovim, the Chrome browser, Emacs and more.
Last week's post discusses the legal ramifications of violating GPL licenses, regardless of intent, which is an area of software licensing that the Codeium team said has been become muddled in the wake of startling new advancements in generative AI and large language models (LLMs). Those LLMs are the "secret sauce" powering the machine learning tech that powers generative AI constructs like ChatGPT and GPT-4 from Microsoft partner OpenAI, the clear leader in advanced AI.
The post states:
Clearly a developer copy-pasting GPL code without consent is bad and grounds for legal action, but what about a generative code model? Is it wrong for such a model to "learn" from this data? The argument to do so is clear -- GPL-licensed OSS is some of the highest quality code that is publicly available, and just like any machine learning model, better quality training data almost always means better quality LLMs. The argument to not do so is perhaps less clear -- researchers say LLMs rarely spit out training data verbatim unless interacted with adversarially, but theoretically, they could. In which case, who is responsible for this clear legal infringement? The developer of the LLM or the user who unknowingly ends up accepting the LLM's suggestions and committing the code to their team's codebase? Honestly, there is no clear answer, but that's the scary part -- no user or company should be subject to legal action, even potentially, just for using an AI code assistant tool.
While GitHub Copilot is trained on GPL-licensed code, GitHub uses nonpermissive filters to screen out potentially problematic code, but Codeium claims those filters don't work, noting that "we at Codeium have removed GPL licensed code from our training data, guaranteeing peace of mind to our users."
With the licensing angle fleshed out, a comparison of GitHub Copilot and Codeium turns to features and functionality. Here, Codeium rounded up salient points for its comparison and boiled them down into the graphic below.
As can be seen, besides being free, Codeium reportedly works in more IDEs and with more programming languages, while sporting similar code-generation functionality. The relative quality of that generated code, though, is measured subjectively. A comparison conducted by Codeium awarded both a 9/10 score, saying, "it appears that Github Copilot and Codeium had roughly similar consistency in addressing the goals across the tasks, with similar rates of manual intervention necessary."
That latter observation comes in a comparison among Codeium and three similar tools: GitHub Copilot, Replit and Tabnine. Unsurprisingly, Codeium comes out on on top, with the team providing the following graphic:
In addition to code completion and related capabilities to explain, refactor and translate code, Codeium comes with search and chat functionality. Chat is the newest capability and is only available on the Codeium extension for Visual Studio Code.
With more than 66,000 installs, the tool promises:
Unlimited single and multi-line code completions forever
- IDE-integrated chat: no need to leave VSCode to ChatGPT, and use convenient suggestions such as Refactor and Explain
- Support through our Discord Community
Codeium also comes in an enterprise offering, which is fully self-hosted and comes with additional features including local personalization on private repositories, with the team noting that enterprises often have higher requirements on data handling and security than do individual developers. However, the enterprise offering only includes code completion, not the newer search and chat functionality. The enterprise offering is priced per-seat, with exact pricing dependent on the size of an organization and any custom needs.
"We are committed to keep improving our data sanitization and filtering processes as well as maintaining a fresh training dataset (with up-to-date license metadata)," Codeium said last week. "We're also going to be taking this approach to remove potentially insecure code practices from our training data. This is possible because we are one of the very few companies that are building AI applications in a fully integrated manner independent of OpenAI -- the training, the models, the serving, the integrations, and the product."
About the Author
David Ramel is an editor and writer for Converge360.