AI-TDD, you write tests, AI generates code
In Test-Driven Development (TDD), you begin by writing tests for the code you wish to had. Only then you start writing code itself. Tests keep you in check, weather you did what you set out to do.
GitHub Copilot has been available for some time, but it did not resonate with me so far. It just generates code to read, comprehend, and evaluate — and reviewing code is more challenging for me usually.
But what if I could write tests and have an AI generate code that passes them? There’s a dirty TDD approach that involves writing tests, locating suitable code on Stack Overflow, integrating it, fixing errors, and moving on. The focus is on well-crafted tests rather than the code itself. While I’m not entirely fond of this approach, its suitable for quick, disposable projects.
I experimented with ChatGPT and similar tools, but the results were underwhelming. However, with the recent surge in GPT-4’s use for recursive autonomous processes and indications that it outperforms GPT-3.5 in code generation quality, I was keen to revisit the concept. Especially this looked intriguing.
Just imagine. You write down tests. Ask AI to generate code that passes. Run tests on it. Pass errors back to AI, leave it be for the night. Let tests keep AI in check. Then ask it to refine and restyle it until its the code you wish you wrote yourself :D
Today I decided to upgrade to ChatGPT Pro to get access to GPT4 and make it pass the tests. Here is whole conversation. And read on for how it looked for me as I was playing around. It was quick an lazy test but it surprised me multiple times.
Part one: AI was right, my tests were wrong
I began with a simple, easily-testable task: writing tests for a quadratic function’s roots and asking GPT-4 to generate a corresponding function. While mostly omitting that it was a quadratic function, I designed the tests to provide GPT-4 hints upon failure. Surprisingly, GPT-4 produced correct code and a clear explanation on a first try. However, the tests failed when executed, revealing errors in my tests, including typos, incorrect arguments, and number imprecision. I requested GPT-4 to rectify these issues, and it successfully fixed my tests while providing insightful explanations of the errors.
What surprised me the most was his quick understanding that one of the problems is not handling Number imprecision. Roots often did not equl 0 but a very small numbers.
Part two: let’s try cubic polynomial roots
I rewrote my code to deal with 3rd order polynomial. From what I knew it becomes more complex(accidental pun). I also did not mentioned anywhere in code that its now cubic. I even, by accident forgot, to change generated messages and they continued to claim that its quadratic function.
Even so it understood from first try that its cubic polynomial now. Explained why its more complex and proposed numerical way to solve it. Surprised me in a bad way, considering there is analytical solution. It did warn that this solution may fail, not find all roots and so on.
Running the code tests passed. But often were showing same roots like this [0.3921288946782216, 0.39212889467818457, 0.39212889467818457]
Knowing how numerical methods work I realised that they probably converge to same root, later though I remembered that sometimes roots can be the same, so not sure what was happening here. But at the time I asked GPT4 about it. Again with 0 hints of solution, just asking as if I know nothing and am confused by results.
At first it improved numerical method. Fixed my incorrect error messages. And then proposed a rewrite to Cordano analytical formula for solution, one I was expecting at the start. It still was interesting it did it with numerical method at the start, but not best solution for this problem in my view.
Part three: evaluate it as if its code interview and improve, rewrite in typescript
Then I asked GPT4, to judge results, as if it was code interview task and GPT4 was evaluating Senior Enginer. The response was informative, but generic as well.
I asked it to rewrite it. When I requested a rewrite, the AI unfortunately lost my original errors that prompted its improvement. However, I appreciated the fact that it implemented a loop to test the code multiple times and explained its reasoning.
When I asked the AI to rewrite the code in TypeScript, it began to struggle. It failed to end its response in one go, requiring prompts to continue, and the formatting of answers started to break. This is where non-programmers may struggle to use ChatGPT like tools, as it requires knowledge to combine larger code base with the smaller pieces it’s changing. Nevertheless, it did generate correct TypeScript code. There was only one error, and after consulting the AI, I received a clear explanation and a solution. After implementing the fix, everything worked smoothly. Feel free to try the code in the TypeScript playground yourself.
Part four: really blew my mind with complex situation
When you run the TypeScript code, you may notice that it often returns a single root like this: “Roots: 1.3893702877926135,,”.
I asked ChatGPT about this issue, pretending to be unaware of the cause. It explained that sometimes, certain roots are complex, and both JavaScript and Cardano’s method only work with real roots. That’s good to know. Next, I requested it to modify the code to handle complex roots as well.
To my surprise, ChatGPT suggested installing the complex.js library, which assists in managing complex numbers in JavaScript. To test if code it produces works, I transitioned from the TypeScript playground to CodeSandbox, where I could install and run npm modules to test the changes.
Admittedly, it became slightly more challenging to implement GPT’s recommendations, as the code had grown too large to work with in a single attempt. This meant that ChatGPT provided only segments of code for modification. However, it continued to perform its task efficiently and quickly. And what is more surprising it worked. It was outputting roots. But I was starting to loose track of what I was even doing here :D It went beyond what I was expecting.
Part five: going deeper in to rabbit hole with asking it to make an UI
I opened a new CodeSandbox with React, copied an empty React component code into the chat, and asked ChatGPT to create a user interface. It did just that.
I then requested it to improve the styling, and it promptly generated some CSS for me as well.
While integrating ChatGPT’s suggestions does require some knowledge to place the code in the correct locations, it has been remarkably efficient. The solutions work on the first or second attempt, which is impressive considering I had anticipated more iterations for it to get things right.
Part six: where it draws plots and starts to fail
I decided to push the boundaries even further by asking ChatGPT to create a plot of the function. At this point, it suggested adding the plotly.js library along with a React component to display the plot. It provided only the segments that needed modification.
After implementing the changes, I encountered some errors. Upon examining the code, I realised that ChatGPT was losing awareness of the code’s structure due to its size. The code was too large to paste back into the chat, and ChatGPT’s output consisted solely of segments. One issue I encountered was that ChatGPT lost track of the scope in which certain referenced elements existed. It proved challenging to explain the problem to ChatGPT, and after a few attempts, I decided it would be more efficient to fix the issue myself.
Interestingly, this situation presents an additional reason to keep code small, modular, and with limited access to scope, as it allows ChatGPT to comprehend the entire code structure more easily. I mean you need to do it without ChatGPT!
It took me several iterations to correctly display the plot and the roots, despite having no prior knowledge of the plotly.js library. Here’s how the final result looks, and you can try it out online.
Takeaways
- If implemented correctly, ChatGPT can undoubtedly be used to develop functions and iterate alongside a unit test system. We can expect to see GitHub bots that autonomously iterate on Pull Requests later this year!
- I experienced no hallucinations while using ChatGPT. In comparison, ChatGPT 3.5 tends to hallucinate when used in a similar manner.
- ChatGPT was able to quickly understand and explain what was happening, three times faster than I initially anticipated.
- It introduced me to new libraries and allowed me to complete an app in just an hour, which otherwise would have taken me around four hours. Without ChatGPT, I might not have even attempted complex root handling.
Overall, ChatGPT’s capabilities are incredibly impressive, and it’s bound to bring significant changes to the way we develop software. It could also serve as an excellent mentor and guide for junior developers learning to write code.
However, ChatGPT does struggle with larger problems. I suspect that a workflow could be developed where it first creates a high-level architecture, including APIs and their expected outcomes. It could then write tests for them before generating implementations in isolation. While it may still struggle with large-scale system integration, it’s uncertain whether addressing such issues would be too costly at that level.
Considering the global salary range for software developers, which spans from $50 to $1000 per hour, one could argue that it is more cost-effective to run ChatGPT for an hour than to employ a software developer. We may already have systems capable of writing software; they just haven’t been widely distributed and made accessible to everyone yet.
What’s next for me
I will definitely be experimenting more with all of this. One idea I’d like to try involves having my wife, who works as a school assistant, transform children’s homework into small games that generate personalized challenges. We’ve always wanted to try something like this but lacked the time. Now, ChatGPT can help.
Furthermore, I plan to explore more ways to use ChatGPT productively at work. I have a feeling that, with the right workflows, it could significantly accelerate development in certain situations, potentially by multiple orders of magnitude.
What are your thoughts on this?