IBM’s Artificial Intelligence Research Division presented a dataset of 14 million samples to develop machine learning models that can aid in programming tasks. The dataset, dubbed Project CodeNet, gets its name from ImageNet, the famous repository of tagged photos that revolutionized computer vision and deep learning. Writes about this Venture Beat.
Programmers discover new problems and explore different solutions using many mechanisms of conscious and subconscious thinking. In contrast, most machine learning algorithms require well-defined tasks and large amounts of annotated data to develop models that can solve the same problems.
Much effort has been put into the development of datasets and tests for the development and evaluation of AI for Code systems by the expert community. But given the creative and open-minded nature of software development, it is very difficult to create the perfect data set for programming.
Using Project CodeNet, IBM researchers tried to create a multipurpose dataset that can be used to train machine learning models on a variety of tasks. The creators of CodeNet describe it as “a very large-scale, diverse and high-quality dataset to accelerate algorithmic advances in artificial intelligence for code.”
The dataset contains 14 million code examples with 500 million lines of code, written in 55 different programming languages. Code samples were obtained from submitted nearly 4000 problems hosted on the online coding platforms AIZU and AtCoder. The code examples include both correct and incorrect answers to the given tasks.
One of the key features of CodeNet is the number of annotations added to the examples. Each of the encoding tasks included in the dataset has a textual description as well as processor time and memory limits. Each code submission contains a dozen pieces of information, including language, submission date, size, execution time, acceptance, and error types.
Researchers at IBM also went to great lengths to balance the dataset on a variety of parameters, including programming language, acceptability, and error types.
CodeNet isn’t the only dataset for training machine learning models on programming problems. But there are several characteristics that make it stand out. First, there is the huge size of the dataset, including the number of samples and the variety of languages.
But perhaps more important is the metadata that comes with the code samples. The rich annotations added to CodeNet make it suitable for a diverse set of tasks, unlike other coding datasets that specialize in specific programming tasks.
There are several ways to use CodeNet to develop machine learning models for programming tasks. One of them is language translation. Because each coding task in the dataset contains representations of different programming languages, data scientists can use it to create machine learning models that translate code from one language to another. This can be useful for organizations looking to port old code to new languages and make it available to new generations of programmers.