CrosswordGPT: Pair Programming with GPT-4

Now that GPT-4 has been released and a lot of incredible demos are coming out, I wanted to give it a try with a complex project. I settled on a React-based crossword puzzle because there are a few moving parts and some tricky state to manage. Ultimately, I (we?) built this:

The experience itself was fascinating. There were times where GPT-4 felt magical with limitless potential and there were times where it was just frustrating to work with.

In the end, it took me 3 attempts to get it to generate this application. With each attempt, I changed how I was interacting with it - with the best results coming from when I worked alongside it instead of expecting it to do everything for me.

End to end, this took about 3-4 hours to build, including my failed attempts. Let’s look at how it went!

Generating the Clues/Answers

I mostly wanted to focus on having GPT-4 build the web application, but a nice perk of having a language model is that it can generate the clues and answers too. I gave it some example crossword puzzle answers, asked it if it understood, then asked it to generate some more:

I’m the founder of an authentication company, which is why I choose that topic, but GPT-4 was pretty great at most topics I gave it. After that, I asked it for another 30 clues.

Were all the clues and answers great?

…no, but I was impressed with how well it did given how short of a prompt I gave it.

I took the answers, passed them into a python script to place them onto a 13x13 grid, converted it to JSON, and ended up with something that looks like this:

{
    'cols': 13,
    'rows': 13,
    'answers': [{
        'col': 1,
        'row': 1,
        'down_or_across': 'across',
        'clue': 'The process of ensuring that the person is who they claim to be.',
        'answer': 'verification',
        'number': 1
    }, {
        'col': 5,
        'row': 1,
        'down_or_across': 'down',
        'clue': "A method of authentication that involves scanning an individual's unique physical feature.",
        'answer': 'fingerprint',
        'number': 3
    }, ...
}

And then I had everything I needed to start building my application.

Generating a crossword puzzle application with GPT-4

Attempt 1

*You are a developer building a crossword puzzle web application. You should use React. The application itself should include a grid that is hard coded to a specific puzzle. An example puzzle looks like this:

… puzzle JSON …

The grid should display blank cells for spaces where an answer is located and should display a black square if there is no answer.

Could you generate the React code for this web application?*

As you’ll see, this prompt could’ve been better, but GPT-4 ran with it.

Including CSS:

And it looked good!

I started asking for more and more features:

Could you add a list of clues split into sections for Across / Down on the left of the puzzle?

Could you change the CSS so the clues and crossword take up equal space?

Could you also make it so the crossword is always a square?

Great! Now let's make it interactive. I want the user to be able to click on the cells and type their answers. When they click a cell once, it'll default to typing across. When they click a second time, it should switch to typing down. Then, the clue for the answer they are actively filling in should be highlighted.

And this quickly lead to a lot of complexity in the one, single component it was updating:

Which led to bugs:

I was never able to get it to fix all the bugs, but at the same time, found myself feeling responsible for the bugs. If I could just ask it better questions or give it more context up front, maybe it would’ve done better.

Refactoring

You know that feeling that you get as a developer where a class is just getting really long and unwieldy? Any attempts to patch it just make it worse? That’s how I felt. As a last ditch effort, I tried to get it to refactor the code:

but I just piled on more complexity to an already complex application.

And I ultimately decided to start over.

Are you sure?

The most intriguing interaction, for me, was this:

Admittedly, this question was born out of frustration more than anything else. I didn’t tell it it was wrong (it was, unfortunately), I just asked if it was sure and it corrected itself.

Attempt 2

I decided that the issue with my first attempt is that my prompt wasn’t good enough. GPT-4 is great, but it’s not a mindreader. This time I gave it all the details up front:

…
- The application should have two columns. On the left is the clues, on the right is a square grid for the crossword puzzle.
- For the grid, if there's a letter in a square, the square should be white and should be something the user can select.
- If there's no letter in a square, it should be black and the user should not be able to select it.
- When a user selects a square, it should be highlighted. The user can then type and it will automatically go to the next cell in the grid. By default, the user will be typing across, but can hit space to type down.
- When a user is on a given cell, the clue corresponding to that cell should be highlighted. This includes not just the initial cell, but every cell that has a letter for that clue.
- You should use Next.js / React.
- You should provide any components and CSS.

Given this prompt, it decided to create components for the different parts of the application.

And unfortunately... this just never worked.

I tried to give it stronger hints about what was wrong, which helped:

But when it did load, the CSS was wrong. I tried a few times but ultimately gave up and decided to take a different approach.

Attempt 3

This time I felt way more prepared. I decided that I shouldn’t expect it to dump out a full application and instead I can ask it for iterative updates. That way I can course correct faster.

I gave it the same prompt as last time, but with this line at the end

We're going to build this out iteratively. Could you start by telling me the different components you'd like to build?

We started by just building the Layout:

I did a little bit of guidance and testing, and we were making progress:

If it made a mistake, I was able to correct it before the application got too complex:

And things were looking good again!

Uncorrectable mistakes

As the conversation went on, it was harder for GPT-4 to fix issues. I’m guessing this is a combination of the application just getting more complex, the challenges associated with a large context window, and me being bad at prompting it.

For one specific example, in this block of code:

it forgot to pass in setTypingDirection as a prop. I mentioned setTypingDirection was undefined and it went down the wrong path.

At this point, I realized that it was significantly more challenging for me to come up with the prompt that gets GPT-4 to fix the code, rather than just fixing it myself. I hooked up the prop myself and moved on.

Back on track

After that, it continued adding features. It hooked up the arrow keys and backspace. It made a clue list that highlighted the currently active clue. It fixed bugs when prompted with the error message.

I did have to step in a few more times, especially for any updates that crossed multiple components. Once it had gone done a bad path, it was really difficult to fully course-correct it. In hindsight, I could’ve reset the prompt and given it all the code it had written so far.

But in the end, with a little help from me, the application is done. You can test it out here.

What did I learn?

Working together beats trusting it fully

When I first started, I wanted GPT-4 to just do all my work. My workflow was honestly:

Ask it to add a feature
Switch tabs to go do something else
Come back, test the new code
If it worked, ask for the next thing. Otherwise, report the bug and hope it fixes it

Which was ok, but there were times I was trying to figure out what prompt I could give it to get it to fix the bug that I could see myself.

Near the end, I was treating it more like a pair programming session. There’s no harm in just fixing issues on its behalf. I was still more effective with it by my side than I would’ve been by myself.

It’s not perfect, but it’s still very very good

The fact that I was able to get a mostly working React crossword puzzle by just… asking for one is still incredibly impressive. No, it’s not perfect. I wouldn’t want it committing code unsupervised, especially in a larger codebase, but it still saved me a good amount of time here.

What am I to GPT-4?

In an optimistic light, I was a super-charged developer living in the future with an AI that can take my ideas and turn them into real working code. In a pessimistic light, I was an annoying middle manager for an AI developer and mostly existed to test the product and update the requirements underneath them.

Which was it? Maybe a little bit of both.