GitHub Copilot (and Codex) Explained - OpenAI's English to Code Generator Model

You've probably heard of GitHub's recent Copilot tool, which generates code for you. This tool can be seen as an autocomplete plus plus for code. You give it the name of a function along with some additional information, and it generates the code for you quite accurately. However, it won't just autocomplete your function; rather, it will try to understand what you are trying to do to generate it. It can also generate much bigger and more complete functions than classical autocomplete tools. This is because it uses a similar model as GPT-3, an extremely powerful natural language model.

GPT-3 Foundation

GPT-3 is a language model, so it wasn’t trained on code but natural human language. If you try to generate code from the primary GPT-3 model from OpenAI's API, it won't work. In fact, in their new paper released for GitHub Copilot, OpenAI tested GPT-3 without any further training on code, and it solved exactly zero Python code writing problems.

Adaptation for Code Generation

So how did they take such a powerful language generation model that is completely useless for code generation and transform it to fit this new task? The first part is easy; it had to understand what the user wants, which GPT-3 is already pretty good at. The second part is hard since GPT-3 never saw code before, or not a lot of it. GPT-3 was trained on pretty much the text from the whole internet.

Now, OpenAI and GitHub are trying to build a similar model but for code generation. They started from GPT-3 using a very similar model and then tackled the second part of the problem—generating code, by training this GPT model on billions of lines of publicly available code on GitHub instead of random text from the internet. The power of GPT-3 is pretty much the amount of information it can learn from. So, doing the same thing but specializing in code would certainly yield some amazing results.

Training on Public Repositories

More precisely, they trained this adapted GPT model on 54 million public software repositories hosted on GitHub. Now we have a huge model trained on a lot of code examples. The problem is, you can't really know for sure it works and is well written since the data is randomly sampled from GitHub. It may cause a lot of issues.

Fine-Tuning for Enhanced Performance

A great way they found to improve the coding skills of the model further was to fine-tune it on code from competitive programming websites and repositories with continuous integration. This code is most likely good and well written, albeit in smaller quantity. They fine-tuned the model with this new training dataset in a supervised way, training the model a second time on a smaller and more specific dataset of curated examples.

Fine-tuning is a powerful technique often used to improve the results for specific needs. Instead of starting from nothing, a model is often much more powerful when trained with more data, even if it's not useful for our task, and further adapted for our task instead of training a new model from nothing with little curated data. When it comes to data and deep learning, it’s often the more, the better.

Limitations and Privacy Issues

The best versions of this model are what's used in GitHub Copilot and the Codex models in the OpenAI API. Of course, Copilot is not perfect yet and has many limitations. It won't replace programmers anytime soon, but it has shown amazing results and can speed up the work of many programmers for coding simple but tedious functions and classes.

They trained the Copilot model on billions of lines of public code, but from any license. Since it was made in collaboration with OpenAI, they will of course sell this product. It's perfectly cool that they want to make money out of a powerful tool they built, but it may have some complications when it was made using your code with restrictive licenses.

Keywords

GitHub Copilot
Code Generation
OpenAI
GPT-3
Machine Learning
Public Repositories
Fine-Tuning
Privacy Issues
Competitive Programming

FAQ

What is GitHub Copilot?
- GitHub Copilot is a tool that generates code for you, functioning as an advanced code autocomplete tool.
How does Copilot differ from classical autocomplete tools?
- Unlike traditional autocomplete tools, Copilot tries to understand what you're attempting to do and can generate much bigger and more complete functions.
**What is GPT-3?**
- GPT-3 is a powerful natural language model developed by OpenAI that was initially trained on a vast amount of text data from the internet.
How was GPT-3 adapted for code generation?
- GPT-3 was adapted for code generation by training it on billions of lines of publicly available code from GitHub instead of random text.
What data was used to train the adapted GPT model?
- The adapted GPT model was trained on 54 million public software repositories hosted on GitHub.
What is fine-tuning and how was it used in Copilot's development?
- Fine-tuning involves training a model a second time on a smaller, more specific dataset of curated examples. This technique was used to enhance the coding skills of the Copilot model by training it on code from competitive programming websites and repositories with continuous integration.
What are the limitations of GitHub Copilot?
- Copilot has limitations and is not perfect yet. It won't replace programmers anytime soon but can speed up the coding process for simpler functions.
What are the privacy issues associated with Copilot?
- Copilot was trained on billions of lines of public code, but from any license, and there may be complications when the code used has restrictive licenses.