Yes, that helps. But if you iterate on this a few times (as I did last year with Code Interpreter), it reveals how much LLM's "like" to imitate patterns. Sure, often it will pattern-match on a useful fix and that's pretty neat. But after I told it "that fix didn't work" a couple times (with details about the error), it started assuming the fix wouldn't work and immediately trying again without my input. It learned the pattern! So, I learned to instead edit the question and resubmit.
LLM's are pattern-imitating machines with a random number generator added to try to keep them from repeating the same pattern, which is what they really "want" to do. It's a brilliant hack because repeating the same pattern when it's not appropriate is a dead giveaway of machine-like behavior. (And adding a random number generator also makes it that much harder to evaluate LLM's since you need to repeat your queries and do statistics.)
Although zero-shot question-answering often works, a more reliable way to get useful results out of an LLM is to "lean into it" by giving it a pattern and asking it to repeat it. (Or if you don't want it to follow a pattern, make sure you don't give it one that will confuse it.)
It did look that way and it's a fun way to interpret it, but pattern-matching on a pretty obvious pattern in the text (several failed fixes in a row) seems more likely. LLM's will repeat patterns in other circumstances too.
I mean, humans do this too... If I tell an interviewee that they've done something wrong a few times, they'll have less confidence going forward (unless they're a sociopath), and typically start checking their work more closely to preempt problems. This particular instance of in-context pattern matching doesn't seem obviously unintelligent to me.
This was code that finished successfully (no stack trace) and rendered an image, but the output didn't match what I asked it to do, so I told it what it actually looked like. Code Interpreter couldn't check its work in that case, because it couldn't see it. It had to rely on me to tell it.
So it was definitely writing "here's the answer... that failed, let's try again" without checking its work, because it never prompted me. You could call that "hallucinating" a failure.
I also found that it "hallucinated" other test results - I'd ask it to write some code that printed a number to the console and told it what the number was supposed to be, and then it would say it "worked," reporting the expected value instead of the actual number.
I also asked it to write a test and run it, and it would say it passed, and I'd look at the actual output and it failed.
So, asking it to write tests didn't work as well as I'd hoped; it often "sees" things based on what would complete the pattern instead of the actual result.
LLM's are pattern-imitating machines with a random number generator added to try to keep them from repeating the same pattern, which is what they really "want" to do. It's a brilliant hack because repeating the same pattern when it's not appropriate is a dead giveaway of machine-like behavior. (And adding a random number generator also makes it that much harder to evaluate LLM's since you need to repeat your queries and do statistics.)
Although zero-shot question-answering often works, a more reliable way to get useful results out of an LLM is to "lean into it" by giving it a pattern and asking it to repeat it. (Or if you don't want it to follow a pattern, make sure you don't give it one that will confuse it.)