News
GitHub Copilot Chat Tackles Java 'One Billion Row Challenge'
Microsoft's Antonio Goncalves put the advanced GitHub Copilot Chat AI tool to work in a coding challenge, and he was impressed with the results.
The One Billion Row Challenge (1BRC) is a Java programming challenge announced early this year by Gunnar Morling that involves processing a text file with 1 billion rows, calculating the min, mean and max temperature value per weather station, and displaying the results sorted alphabetically by station. The goal was to create the fastest implementation. Goncalves took a stab at the challenge to see how Copilot Chat could help. While the Copilot/Chat tools work with Visual Studio 2022 and Visual Studio Code, he chose a non-Microsoft IDE from JetBrains that is also supported, the IntelliJ IDEA.
The base algorithm that challenge creator Morling provided to start with took 4 minutes and 50 seconds to run. Some developers actually managed to process 1 billion rows of data in less than 2 seconds, with the top entry being 00:01.535 (in the minutes:seconds.milliseconds format).
Goncalves, a principal software engineer in Microsoft's Developer Division, didn't come close to two seconds, but he was nevertheless impressed that his code took less than 60 seconds to run on his Mac M1 running on Sonoma with 8 cores and 64Gb of RAM. So he created a pull Rpequest on the 1BRC repository and Morling merged it. His code ran slower on the target platform (Hetzner AX161 server with eight cores): 1 minute and 9 seconds. So he was disappointed that it ran slower than it did on his machine, but pleased overall with the overall experience and performance.
"My algorithm is indeed slower than the top ones listed on the leader board," he said in a March 7 blog post. "But it only took me a couple of hours to write, and the code produced by GitHub Copilot is easy to read and to understand ... and still 4 times faster than the baseline."
He summarized his multi-step process that included optimizing the algorithm and the JVM (Java Virtual Machine) itself. And he provided feedback on using GitHub Copilot Chat with a dialogue over several hours that effectively maintained the context of the conversation, providing relevant suggestions and solutions based on the current coding challenge context:
During all the process I was the one in charge of the code. I was the one who decided to accept or reject the suggestions made by GitHub Copilot Chat, or to use a profiler or not. Sometimes GitHub Copilot would give me a suggestion that I would reject because I knew it would not improve the code. Sometimes I would just take control of the code and change it directly in the IDE. Sometimes I would impose my choices to GitHub Copilot (e.g. Being written in Java 21, please use records instead of classes). Sometimes GitHub Copilot gave me a suggestion that I knew wouldn't improve the code, so I rejected it with a thumbs down (which helps Copilot provide better responses in the future).
He also noted the fastest algorithms used different low-level techniques:
- Partitioning the file into ranges equal to the number of available processors
- Extracting and storing the weather station names using sun.misc.Unsafe as sequences of integers
- Using parallelism, branchless code and implementing SWAR (SIMD as a Register)
- Implementing their own “very simple” HashMap backed by an array
- Creating code without branches and instead performing a few complex arithmetic and bit operations
- Compiling Java into native code using GraalVM
More about the challenge can be found in its 1brc GitHub repo.
About the Author
David Ramel is an editor and writer at Converge 360.