That's what's covered by the "assuming you have formalized the statement correctly" parenthetical.
Given a formal statement of what you want, Lean can validate that the steps in a (tedious) machine-readable purported proof are valid and imply the result from accepted axioms. This is not AI, but a tiny, well reviewed kernel that only accepts correct formal logic arguments.
So, if you have a formal statement that you've verified to represent what you are interested in by some other means, Lean can tell you whether the proof created by genAI is correct. Basically, there is a nigh infallible checker that won't accept incorrect hallucinations.
Yeah people dramatically overestimate the difficulty of getting one's definitions correct for most problems, especially when you are doing an end to end proof rather than just axiomatizing some system. They are still worth looking at carefully, especially for AI-generated proofs where you don't get the immediate feedback that you do as a human when something you expect to be hard goes through easily, but contrary to what seems to be popular belief here they are generally much easier to verify than the corresponding proof (in the case of formally verified software, the corresponding analogy is verifying that the spec is what you want vs. verifying that the program matches the spec; the former is generally much easier).
Everyone has a different perspective, based on their math background. From the OP's perspective, the formalization of this problem statement was apparently worth talking about. On the other hand, for you it's just a homework problem that belongs in an intro class.
Let's just be generous and try to accept these differences.
My comment was exactly about the required math background though. Anyone who's completed an intro to proof class would find that kind of statement easy to formalize because they would have had to write similar statements for homework. That provides some context: everyone who's interested in computer theorem provers probably has some experience with proofs, so formalizing that statement should be easy for them. i.e. (for this kind of problem) it's not really "the hard part" for people who are seriously working on this stuff.
Are you an expert? Not gatekeeping here but I have no intuition for what is easy or hard to formalise. A lot of very simply stated graph theoretical results are apparently extremely hard to formalise.
I can't speak for ndriscoll, but I am a university math professor with extensive experience teaching these sorts of topics, and I agree with their comment in full.
You are right that some (other) statements are harder to formalize than they look. The Four Color Theorem from graph theory is an example. Generally speaking, discrete math, inequalities, for all/there exists, etc. are all easy to formalize. Anything involving geometry or topology is liable to be harder. For example, the Jordan curve theorem states that "any plane simple closed curve divides the plane into two regions, the interior and the exterior". As anyone who has slogged through an intro topology book knows, statements like this take more work to make precise (and still more to prove).
When someone takes the time to explain undergrad-level concepts in a comment, responding with "are you an expert?" is a level of skepticism that's bordering on hostile. The person you're responding to is correct, it's rare that the theorem statement itself is particularly hard to formalize. Whatever you read likely refers to the difficulty of formalizing a proof.
To be fair, the comment did not explain any concept that I can see, or why this statement is simple. It gave the statement and said it was simple to formalise. It does seem simple enough to me (basic arithmetic statement with a few variables and a bunch of quantifiers) but if somebody has no expertise/intuition, I think it is a fair question, without any hostile intent assumed.
> it's rare that the theorem statement itself is particularly hard to formalize
That's very dependent on the problem area. For example there's a gap between high school explanation of central limit theorem and actual formalization of it. And when dealing with turing machines sometimes you'll say that something grows e.g. Omega(n), but what happens is that there's some subsequence of inputs for which it does. Generally for complexity theory plain-language explanations can be very vague, because of how insensitive the theory is to small changes and you need to operate on a higher level of abstraction to have a chance to explain a proof in reasonable time.
Yes, if the theorem statement itself is "hard to formalize" even given our current tools, formal foundations etc. for this task, this suggests that the underlying math itself is still half-baked in some sense, and could be improved to better capture the concepts we're interested in. Much of analysis-heavy math is in that boat at present, compared to algebra.
Lol it’s weird seeing high school redditors saying gatekeeping and are you an expert in the same thread as university professors, all talking about the same topic. But I guess that’s HN for you.
As the sibling comment says, in my experience, the difficulty of formalizing the problem varies greatly between different areas. Probability theory is another notorious field in which modelling a problem correctly can be very subtle and difficult.
On the other hand, many problems in number theory and discrete structures tend to be rather simple to formalize. If you want to take your own look at that, I'd recommend to check out the lean games[1]. I'd say after the natural numbers game, you most likely know enough lean to write that problem down in lean with the "sufficiently small" being the trickiest part.
Voters: please reconsider your ups and downs. I think the “Are you an expert” question triggered a lot of downvotes when it was in fact asked in good faith to judge the person’s perspective of easy and hard.
And I would say there is no way to ask that question in good faith. (Tedious proof by cases left as an exercise for readers.)
The correct question would have been, does anyone else agree with the statement.
In this particular case, the amount knowledge needed (of e.g. Lean language, math and Erdos problems) means any credible statement about the difficulty requires an expert.
It doesn't require an expert though. That was kind of my point. If you've taken an intro proof class (so the intro class for math majors/the topic), and if you've fiddled with Lean a bit (e.g. played some of the Natural Numbers Game another commenter linked), you'll know it's easy (source: I did math in my undergrad, and have fiddled with Lean a bit). Honestly I expect intro proof classes will start to be centered around something like Lean soonish if some aren't already incorporating it, and we'll see math majors more explicitly making the connection between program and proof.
Like if someone were incredulous that we could reasonably analyze running time and memory usage of something like merge sort and I said that's a standard example in an intro algorithms course, presumably people would be like "yeah it is".
They can read the statement, and the definitions that the statement references. If everything it references is in a well-tread part of the Lean library, you can have pretty high confidence in a few minutes of going over the syntax.
Isn't that kind of a general problem with proofs, even when they're written by humans? There's nothing stopping someone from accidentally writing their own Lean proof that has slightly different semantics than an English version of the same proof, or even for their English proof to subtly miss something important or make an incorrect logical leap. This seems like a bit of a double standard, although maybe there's nuance here I'm missing.
One nuance you are missing is that the discussion is about formalizing the statement (the theorem), not the proof. The latter is what the article is about, but that doesn't suffice if you can't trust that the statement is also correctly formalized.
These are the two main problems:
1. Formalizing a theorem.
2. Finding a formal proof.
Part 2 is where AI could help as proof search is full of heuristics. That's also how humans find proofs and is one of the main skills of a mathematician. The formal proof can then be machine checked with well known and mature techniques not involving AI.
Part 1 is the part that's missing and will always be hard. It's also the issue with formal verification of programs for which correctness criteria are often very complex and it's easy to mess up the formalization, so that even if you trust the proof, you can't trust that it proves the right thing.
To borrow some definitions from Systems engineering for verification and validation, this question is one of validation. Verification is performed by Lean and spec syntax and logic enforcement. But Validation is a question of is if the Lean spec encodes a true representation of the problem statement (was the right thing specced). Validation at highest levels is probably an irreplaceable human activity.
Also, on the verification side - there could also be a window of failure that Lean itself has a hidden bug in it too. And with automated systems that seek correctness, it is slightly elevated that some missed crack of a bug becomes exploited in the dev-check-dev loop run by the AI.
I can read and understand e.g. Python, but I have seen subtle bugs that were hard to spot in code generated by AI. At least the last time I tried coding agents (mid 2025), it was often easier to write the code myself then play "spot the bug" with whatever was generated. I don't know anything about Lean, so I was wondering if there were similar pitfalls here.
In this case the human written statement of the theorem is small. Can there be bugs? Absolutely! But it's essentially a few lines of code worth of thinking.
The lean proof checker then checks to make sure the proof actually proves the statement.
In this case an AI is generating the proof, but if it "compiles" it's correct. The only thing humans need to check is the statement to be proven.
(I don't know anything about this project but I've played around with lean and used other proof checkers more sesrisously).
This is also an issue with non-AI and non-Lean proofs. Andrew Wiles' initial Fermat's Last Theorem proof initially had an error in it. That was spotted by peer review, fixed, and an updated proof was submitted.
I don't disagree with you, but on the other hand, I feel the same way about code written by other humans, and that's not because they're necessarily worse at writing code than me, but because for code I've written myself, I've already spent time thinking about it, so I don't have to start from scratch when re-reading it. It's also not like I don't think I potentially write as many bugs as my coworkers either; it's just easier for me to catch my own up front as I'm coding than it is to catch theirs in code review. The two main differences are that I can have a more meaningful conversation with my coworkers about their approach, what bugs they might think are worth looking out for, etc. compared to an LLM (which in my experience will claim completely incorrect things about the code it wrote far more often than any engineer I've worked with even junior ones; the humans I've worked with have uniformly been able to report how confident they are in what they've produced being what they were tasked with without insane exaggerations), and that an LLM can produce a much higher volume of plausible-enough looking code in a given unit of time than most humans I've worked with. It's not obvious to me that these would be particularly severe problems in generating proofs though; unless the proof is so large that it would be infeasible to read through it in a reasonable amount of time, I would expect mathematicians to be able to make up for the lower quality conversations with the "author" by devoting more time to reading and thinking, or having someone else also read through the proof and talking through it with them. If anything, it's possible that the timeline for writing up a paper about the results might be better for some mathematicians than the equivalent amount of time most engineers have to spend reviewing code before the pressure to get it merged and move on to the next thing. (I'm aware that there is certainly pressure to get things published in academia, but I don't have firsthand experience, so I've tried to be intentional in how I've worded this to clarify that I want to avoid any assumptions about what the norms would be, but given the relatively wide range of time pressure that engineers might experience across the industry as a whole, I'd expect that at least some mathematicians might have some flexibility to spend extra time reading through an LLM-written proof, especially if it might be time they'd otherwise have to spend trying to come up with the proof themselves).
As I understand it Lean is not a general purpose programming language, it is a DSL focused on formal logic verification. Bugs in a DSL are generally easier to identify and fix.
It seems one side of this argument desperately needs AI to have failed, and the other side is just saying that it probably worked but it is not as important as presented, that it is actually just a very cool working methodology going forward.
Compiling isn’t sufficient because it doesn’t tell you if the program matches the specification. A program that always says the temperature is 80 F will compile but is a terrible solution to what is the temperature outside at this location right now.
A little bit nitpicking, but according to books like AIMA that is indeed AI. In the first chapter even any control system is classified as AI.
Because of the reasons stated in the 1st chapter, I totally agree with the authors.
The whole system is AI. That part is a verifier in a chain of “suggestions/instict -> verifier” like used in neurosymbolic systems for automated driving, for example.
1. you write a proof in English that there is an infinite number of primes.
2. the llm writes 2+2=4 in lean.
3. lean confirms that this is correct and it's impossible that this proof is wrong.
You missed a whole section - a person creates a Lean formalization of #1 and Lean promptly says the AI proof is wrong because it doesn’t prove that formal problem statement.
The question is in the person (or AI) creating the formal problem statement - how do you know it represents the problem the proof is supposed to be for? And the answer is for people in the field, in this case, formalizing the problem and verifying the formalization is easy. It is like generating an public key versus factoring it.
I thought that's what I was trying to express between lines 1 and 2 above, but I may have failed to get it across. my understanding is that the danger is that the llm will create a proof that is correct but isn't about what the person thinks he's proving?
Soo, it can definitively tell you that 42 is correct Answer to the Ultimate Question of Life, The Universe, and Everything. It just can't tell you if you're asking the right question.
Calling Lean "AI" is quite a stretch. Though I'm also in the camp that dislikes the inflationary use of "AI" for LLMs, so I have sympathies for your viewpoint.
So basically anything we don’t know how to write an algorithm for? I see where you’re coming from - but at the same time it’s actually an AI meme and smells of permanently moving goalposts.
XOR is not a great choice here. Consider that 2 copies of a row give the same result as 0 (or 4, 6, etc). And even without multiple copies of rows, you can force any hash you'd like by observing what happens when you insert more random rows
and finding a subcollection that flips exactly the bits you want.
What you probably want to look at is homomorphic hashing. This is usually implemented by hashing each row to an element of an appropriate abelian group and then using the group operation to combine them.
With suitable choice of group, this hash can have cryptographic strength. Some interesting choices here are lattices (LtHash), elliptic curves (ECMH), multiplicative groups (MuHash).
It depends on whether you are doing something security critical with the result.
Maybe you have a trusted table hash but only a user-supplied version of the table. Before you use that data for security sensitive queries, you should verify it hasn't been modified.
Basically, if you ever have to contend with a malicious adversary, things are more interesting as usual. If not, addition is likely fine (though 2^k copies of a row now leave the k lowest bits unchanged).
Sometimes these are officially sanctioned and even required! Most notably, including logos/other embellishments are a form of "trap street" where the idea is to have something to point to that trivially proves someone is using your mask design if it comes to it.
It was funny to notice my mental knee-jerk reaction of "ought to be enough for anyone" (referring to 640K RAM). I've never used args with env until last week in the first place and even then it was a (mental) debate of solving it in code (wrapping print() in a function that also flushes stdout) or adding the argument instead, so I guess that's why that was my initial thought. But of course that doesn't mean there can't be legitimate uses of long argument lists. Java code comes to mind (although I do question the sanity of many things Java).
My understanding is that was the original purpose, but probably unnecessary given that all the relevant process steps are encased in tools that are not exposed to the general clean room environment in operation (they have to maintain a higher grade of clean room internally).
However, I suspect it's probably going to continue being propagated forward due to Copy Exactly which has historically meant that Intel documents and copies the smallest details in all their fabs just in case they are necessary. In the end, if it ain't broke...
Chesterton's Fence [0] should overrule fear of cargo culting, especially when the fence (or in this case color) was clearly put up by people who really knew what they were doing at the time.
Unless there is a significant cost to it, and unless the benefit is understood, there is no reason that they should be faulted as ignorant cargo cultists simply for keeping it in place.
> For decades, Intel’s cleanrooms have been lit like darkrooms, bathed in a deep, low yellow. “That’s an anachronism,” says Mark Bohr, a small, serious man who has spent his entire 38-year career making chips, and who’s now Intel’s top manufacturing scientist. “Nobody’s had the courage to change it.”
It's possible that some level of unresolved empirical "magic" be involved in the process, which does not diminish the value of the end product. Why deprive yourself of hard learned lessons even if you don't understand it all?
If I wanted to keep the blur aesthetic, I'd probably do a full removal, run an inpainting algorithm to replace the removed region with something less jarring in context and then blur the result. The inpainting algorithm can be fairly low quality and still get acceptable results since it won't be seen directly.
Basically is what (I think) submodules should have been. Creates a vendored copy with metadata about what commit it came from. Normal operations like clone, commit (even touching multiple main/subrepo files which subtree struggles with) are unaffected (normal files from git's point of view).
Pull, push, branching all work as expected on main project (most devs don't even need to know there's a subrepo). If you want to pull/push vendored changes from subrepo, there's a new command, but that's it.
As Izkata mentions, the superproject has _all_ the files for a given commit without any need for users to have access other repos, additional submodule init commands, etc.
Basically, after a subrepo clone, you've copied the file tree for the subrepo and can make commits on it in the superproject to your heart's content (branching, etc). This is basically a fork/mirror, but adds a single metadata file to track the last upstream commit you pushed/pulled from to allow reconciling later. So with git subrepo, you make commits in the superproject first and can choose (at some later point, if at all) to merge with the subrepo upstream. This is arguably consistent with the git model writ large (make local commits, later choose if/how to integrate those with upstream). Importantly, people that clone your superproject repo don't have to know anything about subrepo or special commands to send changes back to you.
For submodules, changes flow in the other direction. If you want to make a change to the child repo, you must 1) commit in the submodule, 2) push the submodule commit to its upstream, 3) make a commit in the super project that changes the commit the submodule is pointing to. When someone pulls, switches branches, etc in the superproject, they need to do a submodule update with various failure modes or else they end up with empty/out-of-date content.
Subtrees are a bit similar to subrepos, but in practice you still need to be aware of their boundaries since you can't mix subtree/superproject modifications in the same commit. Moreover, you need to use a special merge strategy rather than git default merge/rebase which subrepo uses.
I can't answer in general, but I do see one possible misunderstanding: "vendored" means the code is copied into the parent repo.
If a submodule's original source goes offline, a new clone of the parent repo won't be able to retrieve it, since it just stores a reference to where to clone from. If subrepo really vendors the code, that'll never be a problem since a full copy is committed directly to the parent repo.
There's no need to store all previous states to detect a cycle: it can be done using just twice the original memory.
You can model the state transitions as a linked list and use Floyd's classical tortoise and hare algorithm to (eventually) determine a cycle exists. Initialize "tortoise" and "hare" as two copies of initial state and advance hare by two instructions and tortoise by one each iteration. A cycle is reported if the two states become equal again.
It's true your solution will work, but you're essentially trading space complexity for time complexity. With 2^524288 states, instead of requiring more RAM than there are atoms in the universe worth of RAM, you'll now need more seconds of compute time than we have left before the heat death of the universe.
Note that the time taken is linear with respect to the original execution.
A cycle of n instructions starting at instruction n0 will be detected in between n0 and n0+n iterations (i.e. <= 3*(n0+n) underlying instructions) since n0 iterations gets both into the cycle and the offset between them will become 0 some time in the next n iterations.
n could be very large, of course (e.g, using all of memory as a giant counter so the cycle length is huge), but the cycle detection is not really making your problem worse.
Thank you. This is the sort of thing I was alluding to when I suggested it was harder than just doing a simple replacement on the standard halting argument.
In this case, I'll back it down to suggesting there's probably some way to prove it can't be done without some unreasonable amount of at least one of time and space.
I've had success using \import{subdir/}{BasenameInSubdir} from the import package.
The basic (and unfortunate in a xkcd/1479 way) issue appears to be that the arXiv compile server doesn't allow writing to subdirectories. A subdir \include implicitly requires write access for the aux file.
I've had arXiv automatically block bare uploads of TeX-derived PDFs (presumably identified through PDF metadata). For these, it required that the source be uploaded and compiled on their server.
On balance, it's probably best to require source as arXiv does, but this can create interesting issues from time to time. Since the source is downloadable, researchers can inadvertently end up sharing partial results or snark that was commented out in the TeX source.
Given a formal statement of what you want, Lean can validate that the steps in a (tedious) machine-readable purported proof are valid and imply the result from accepted axioms. This is not AI, but a tiny, well reviewed kernel that only accepts correct formal logic arguments.
So, if you have a formal statement that you've verified to represent what you are interested in by some other means, Lean can tell you whether the proof created by genAI is correct. Basically, there is a nigh infallible checker that won't accept incorrect hallucinations.