Consider also that even reattaching nerves that are supposed to be there is not exactly a walk in the park. Look into finger reattachment surgery and post operation care. Think pain, tingling, a year or more of physiotherapy.. and that's in the best case that it actually works and you don't end up with a "dead" finger. Now, imagine that for your whole body.
Is it a good thing to have merges that never fail? Often a merge failure indicates a semantic conflict, not just "two changes in the same place". You want to be aware of and forced to manually deal with such cases.
I assume the proposed system addresses it somehow but I don't see it in my quick read of this.
It says that merges that involve overlap get flagged to the user. I don't think that's much more than a defaults difference to git really. You could have a version of git that just warns on conflict and blindly concats the sides.
This is kind of how jj handles the situation. git won't let you move on from a rebase if there are conflicts. By comparison, jj will just put a marker in the log pointing out that there are conflicts in a branch. You resolve them whenever you feel like it, but all resolving them does is effectively remove the "conflict" marker and rebase all of the descendent commits (which may clean up merge conflicts, or make them worse).
I realized recently that I've subconsciously routed-around merge conflicts as much as possible. My process has just subtly altered to make them less likely. To the point of which seeing a 3-way merge feels jarring. It's really only taking on AI tools that bought this to my attention.
I'm surprised to see that some people sync their working tree and does not evaluate their patch again (testing and reviewing the assumptions they have made for their changes).
My understanding of the way this is presented is that merges don't _block_ the workflow. In git, a merge conflict is a failure to merge, but in this idea a merge conflict is still present but the merge still succeeds. You can commit with conflicts unresolved. This allows you to defer conflict resolution to later. I believe jj does this as well?
Technically you could include conflict markers in your commits but I don't think people like that very much
If other systems are doing it too then I guess it must be useful
But why is it useful to be able to defer conflict resolution?
I saw in a parallel comment thread people discussing merge commit vs rebase workflow - rebase gives cleaner git history but is a massive pain having to resolve conflicts on every commit since current branch diverged instead of just once on the final result with merge commit.
Is it that? Deferred conflict resolution allows you to rebase but only resolve conflicts at the end?
Delayed conflict resolution in jj is valuable when you're rebasing a long chain of commits. If I rebase a chain of 10 commits and each of the commits has a conflict, I'm stuck in conflict resolution mode until I fix all 10 conflicts. Maybe something else came up, or maybe I got tired of doing conflict resolution and want to do something else. Git's answer is to finish or abandon.
Also, in jj it's pretty easy to rebase a lot of stuff all at once, giving you even more opportunities to create conflicts. Being able to delay resolution can be easier.
Deferred conflict resolution is amazing in jj because I may never return to some of the branches that are in conflict and therefore might never bother resolving them.
I rebase entire trees of commits onto main daily. I work on top of a dev-base commit and it has all kinds of anonymous branches off it. I rebase it and all its subbranches in 1 command and some of those sub branches might now be in a conflicted state. I don’t have to resolve them until I need to actually use those commits.
dev-base is an octopus merge of in-flight PRs of mine. When work is ready to be submitted it moves from being a descendent of dev-base to a parent of dev-base.
Rebasing all my PRs and dev-base and all its descendents is 1 command. Just make sure my @ is a descendent of dev-base and then run: jj rebase -d main
The conflict lines shown in the article are not present in the file, they are a display of what has already been merged. The merge had changes that were too near each other and so the algorithm determined that someone needs to review it, and the conflict lines are the result of displaying the relevant history due to that determination.
In the example in the article, the inserted line from the right change is floating because the function it was in from the left has been deleted. That's the state of the file, it has the line that has been inserted and it does not have the lines that were deleted, it contains both conflicting changes.
So in that example you indeed must resolve it if you want your program to compile, because the changes together produce something that does not function. But there is no state about the conflict being stored in the file.
Isnt that a bit dangerous in its own? If the merge process can complete without conflicts being resolved, doesnt it just push the Problem down the road? All of a sudden you have to deal with failing CI or ghost features that involve multiple people where actually you just should has solved you conflict locally at merge time.
The conflict is no longer an ephemeral part of the merge that only ever lives as markup in the source files and is stomped by the resolution that's picked, but instead a part of history.
I think it is also not true that there's only one correct answer, although I don't know how valuable this is.
For committing let's say yes, only one correct answer. Say the tool doesn't let you commit after you've merged without resolving conflicts.
But continuing to work locally you may want to put off resolving the conflict temporarily. Like person A changed the support email to help@example.com and person B changed it to support@example.com - obviously some wires go crossed and I will have to talk to A or B before committing the merge and pushing, but I can also go ahead and test the rest of the merge just fine.
And heck, maybe even committing after merging is fine but pushing requires resolving. Then I can continue working and committing locally on whatever else I was working on, and I'll only resolve it if I need to push. Which may mean I never need to resolve it, because A or B resolve it and push first.
> The conflict is no longer an ephemeral part of the merge that only ever lives as markup in the source files and is stomped by the resolution that's picked, but instead a part of history.
It allows review of the way the merge conflict has been resolved (assuming those changes a tracked and presented in a useful way). This can be quite helpful when backporting select fixes to older branches.
In this model, conflicts do not exist, so there are no conflict markers (the UI may show markers, but they get generated from what they call “the weave”)
Because of that, I think it is worse than “but it is not valid syntax”; it’s “but it may not be valid syntax”. A merge may create a result that compiles but that neither of the parties involved intended to write.
They address this; it's not that they don't fail, in practice...
the key insight is that changes should be flagged as conflicting when they touch each other, giving you informative conflict presentation on top of a system which never actually fails.
With git, conflicts interrupt the merge/rebase. And if you end up in a situation with multiple rebases/merges/both, it's easy to get a "bad" state, or be forced to resolve redundant conflict(s) over and over.
In Jujutsu and Pijul, for example, conflicts are recorded by default but marked as conflict commits/changes. You can continue to make commits/changes on top. Once you resolve the conflict of A+B, no future merges or rebases would cause the same conflict again.
Yes and no. Most often conflicts could have been handled automatically with better tools. For example I have a script that makes a copy of the whole folder and tries to merge each commit using all of git’s different merge stategies, and all sub stategies, and presents which ones can merge without any conflicts. It has been mind opening. Why git doesn’t have this built-in I don’t understand.
Git also writes (non-logs) to the .git folder for operations that you would assume should have been r/o, but that’s another problem (that affects things later on).
Should you be counting on confusion of an underpowered text-merge to catch such problems?
It'll fire on merge issues that aren't code problems under a smarter merge, while also missing all the things that merge OK but introduce deeper issues.
Post-merge syntax checks are better for that purpose.
And imminently: agent-based sanity-checks of preserved intent – operating on a logically-whole result file, without merge-tool cruft. Perhaps at higher intensity when line-overlaps – or even more-meaningful hints of cross-purposes – are present.
> It'll fire on merge issues that aren't code problems under a smarter merge, while also missing all the things that merge OK but introduce deeper issues.
That has not been my experience at all. The changes you introduced is your responsibility. If you synchronizes your working tree to the source of truth, you need to evaluate your patch again whether it introduces conflict or not. In this case a conflict is a nice signal to know where someone has interacted with files you've touched and possibly change their semantics. The pros are substantial, and it's quite easy to resolve conflicts that's only due to syntastic changes (whitespace, formatting, equivalent statement,...)
If you're relying on a serialized 'source of truth', against which everyone must independently ensure their changes sanely apply in isolation, the. you've already resigned yourself to a single-threaded process that's slower than what improved merges aim to enable.
Sure, that works – like having one (rare, expensive) savant engineer apply & review everything in a linear canonical order. But that's not as competitive & scalable as flows more tolerant of many independent coders/agents.
Decentralization in this case means one can secede easily from the central authority. So anyone working on a project can easily split away from the main group at any time. But every project have a clear governance where the main direction is set and the canonical version of the thing being under version control is stored.
That canonical version is altered following a process and almost every project agrees that changes should be proposed against it. Even with independent agents, there should be a way to ensure consensus and decides the final version. And that problem is a very hard one.
And yet after all these year of git supporting no source of truth we still fall back on it. As long as you have an authoritative version and authoritative release then you have one source of truth. Linus imagined everyone contributing with no central authority and yet we look to GitHub and Gitlab to centralize our code. Git is already decentralized and generally we find it impractical.
He's not saying you shouldn't have conflicts; just that it's better to have syntax-aware conflict detection. For example if two people add a new function to the end of the same file, Git will always say that's a conflict. A syntax-aware system could say that they don't conflict.
> Should you be counting on confusion of an underpowered text-merge to catch such problems?
This does not really follow from my statement.
I said that underpowered text merge should not silently accept such situations, not that it is the only way to catch them. It doesn't replace knowing something about what you are merging, but it is certainly a good hint that something may be wrong or unexpected.
> Post-merge syntax checks are better for that purpose.
Better, yes, but I was addressing semantic issues, not syntactical. I have seen syntactically valid merges result in semantic inconsistency, it does happen.
I do agree with your last statement.. unit & integration tests, agent checks or whathaveyou, these all contribute to semantic checking, which is a good thing.
Can they be relied on here? Maybe? I guess the jury is still out. My testing philosophy is "you can only test for what you think of testing". And tests and agent checks have a signal to noise ratio, and are only as useful as their SNR allows.
There is no guaranteed way to stop bugs from happening, if there were it likely would have been discovered by now. All we can do is take a layered approach to provide opportunities for them to get caught early. Removing one of those layers (merge conflicts) is not clearly a good thing, imho, but who knows.. if agent checks can replace it, then sure, I'm all for it.
Probably depends on what is in the merge. Lately I've been collaborating a ton on PRDs and software specs in markdown (now that agents have gotten pretty good at turning it into usable code) and using git had been pretty painful. Especially when working with a domain expert whose not as technical, git is proving to almost be more of a barrier than an aid.
For this kind of work (which I suspect will only get more common), a CRDT-based VCS makes a lot of sense.
I agree. Nevertheless I wonder if this approach can help with certain other places where Git sometimes struggles, such as whether or not two commits which have identical diffs but different parents should be considered equivalent.
In the general case, such commits cannot be considered the same — consider a commit which flips a boolean that one branch had flipped in another file. But there are common cases where the commits should be considered equivalent, such as many rebased branches. Can the CRDT approach help with e.g. deciding that `git branch -d BRANCH` should succeed when a rebased version of BRANCH has been merged?
> Conflicts are informative, not blocking. ... Conflicts are surfaced for review when concurrent edits happen “too near” each other, but they never block the merge itself.
Something I want to add to the discussion is that the only time I've encountered this was not with a specific company but with an "AI recruitment agency", which I'm seeing getting more and more popular.
And while I get the idea of an agency handling hiring, what bothered me is that the terms of the AI interview were that it was relatively standardized for a given role, and that they would record it and put it on file to show to other companies, with the selling point being: do well in one interview and we'll shop your profile around for you!
Which is.. great if you do well I guess, but.. really unsettling if you don't. I mean, there was zero information that you'd be able to do it over, no advance details of the format, no practice session. So if you fail, or stammer, or get surprised by some detail of some question.. what, you're just "on file" now, out of reach for their entire client portfolio?
At least if you're doing it one company at a time, you mess up, then ok you move on and try again somewhere else. But the idea of making some random mistake (which happens all the time!) just blacklists you for some unknown number of companies, forever..
> Is it that you can backprop through this computation? Do you do so?
With respect, I feel that you may not have read the article.
> Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.
and,
> By storing points across nested convex hulls, this yields a decoding cost of O(k+log n).
and,
> Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.
So yes, and yes.
> Where are the benchmarks?
Not clear what they should benchmark it against. They do compare speed to a normal KV Cache. As for performance.. if it's actually executing a Sudoku solver with a 100% success rate, it seems pretty trivial to find any model doing < 100% success rate. Sure, it would be nice to see the data here, agree with you there.
Personally I think it would be really interesting to see if this method can be combined with a normal model MoE-style. It is likely possible, the router module should pick up quite quickly that it predicts the right tokens for some subset of problems deterministically. I like the idea of embed all sorts of general solvers directly into the model, like a prolog solver for example. In fact it never would have occurred to me to just go straight for WASM, pretty interesting choice to directly embed a VM. But it makes me wonder what "smaller" interpreters could be useful in this context.
I read the article and had the same question. It's written in such a way that it feels like it's answering these questions without actually doing so.
The right thing to benchmark against isn't a regular transformer, it's a transformer that writes programs that are then interpreted. They have a little visual demo where it looks faster but only because they make Python absurdly slow, and it's clearly not meant to be a real benchmark.
I spent the whole article thinking, wow, cool, but also ... how is this better than an LLM steering a regular computer? The closest we get is a statement about the need to "internalize what computation is" which doesn't say anything to me.
Fundamentally, running actual instructions on a real CPU is always going to be faster than running them via a neural network. So the interesting part is where they say you can backprop through it, but, ok, backprop is for cases where we don't know how to encode a function using strict logic. Why would you try and backprop through a Sudoku solver? It's probably my imagination is just limited but I could have used more on that.
Did you read the post you are responding to? It says:
> What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?
The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?"
There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here.
my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.
ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.
EDIT: Actually, they do make this clear(ish) at the very end of the article, technically. But there is a huge amount of vagueness and IMO outright misleading / deliberately deceptive stuff early on (e.g. about potential differentiability of their approach, even though they admit later they aren't sure if the differentiable approach can actually work for what they are doing). It is hard to tell what they are actually claiming unless you read this autistically / like a lawyer, but that's likely due to a lack of human editing and too much AI assistance.
I'm curious if 1-bit params can be compared to 4- or 8-bit params. I imagine that 100B is equivalent to something like a 30B model? I guess only evals can say. Still, being able to run a 30B model at good speed on a CPU would be amazing.
At some point you hit information limits. With conventional quantisation you see marked capability fall-off below q5. All else being equal you'd expect an N-parameter 5-bit quant to be roughly comparable to a 3N-parameter ternary, if they are trained to the same level, just in terms of the amount of information they can possibly hold. So yes, 100B ternary would be within the ballpark of a 30B q5 conventional model, with a lot of hand-waving and sufficiently-smart-training
I assume that theoretically, 1-bit models could be most efficient because modern models switched from 32 bit to 16 bit to 8 bit per parameter (without quantization).
It's not clear where the efficiency frontier actually is. We're good at measuring size, we're good at measuring FLOPS, we're really not very good at measuring capability. Because of that, we don't really know yet whether we can do meaningfully better at 1 bit per parameter than we currently get out of quantising down to that size. Probably, is the answer, but it's going to be a while before anyone working at 1 bit per param has sunk as many FLOPS into it as the frontier labs have at higher bit counts.
The thing with efficiency is that it is relative to both inference and training compute. If you do quantization, you need a more powerful higher precision model to quantize from, which doesn't exist if you want to create a frontier model. In this case the question is only whether you get better inference and/or training performance from training e.g. a native 1 bit model.
Currently the optimal training precision seems to be 8 bit (at least used by DeepSeek and some other open weight companies). But this might change with different training methods optimized for 1-bit training, like from this paper I linked before: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...
it also reminds me a bit of this diffusion paper [1] which proposes having an encoding layer and a decoding layer but repeats the middle layers until a fixed point is reached. but really there is a whole field of "deep equilibrium models" that is similar. it wouldn't be surprising if large models develop similar circuits naturally when faced with enough data.
finding them on the other hand is not easy! as you've shown, i guess brute force is one way.. it would be nice to find a short cut but unfortunately as your diagrams show, the landscape isn't exactly smooth.
I would also hypothesize that different circuits likely exist for different "problems" and that these are messy and overlapping so the repeated layers that improve math for example may not line up with the repeated layers that improve poetry or whatever, meaning the basic layer repetition is too "simple" to be very general. that said you've obviously shown that there is some amount of generalizing at work, which is definitely interesting.
This is interesting because I've been considering a similar project. I maintain a package for a scientific simulation codebase, it's all in Fortran and C++ with too much template code, which takes ages to build and is very error prone, and frankly a pain to maintain with its monstrous CMake spaghetti build system. Furthermore the whole thing would benefit with a rewrite around GPU-based execution, and generally a better separation between the API for specifying the simulation and the execution engine. So I've been thinking of rewriting it in Jax and did an initial experiment to port a few of the main classes to Python using Gemini. It did a fairly good job. I want to continue with it, but I'm also a bit hesitant because this is software that the upstream developers have been working on for 20+ years. The idea of just saying to them "hey look I rewrote this with AI and it's way better now" is not something I would do without giving myself pause for thought. In this case it's not about the license, they already use a permissive one, but just the general principle of suggesting a "replacement" for their work.. if I was doing it by hand it might be different, I don't know, they might appreciate that more, but I have no interest in spending that much time on it. Probably what I will do is just present the PoC and ask if they think it's worth attempting to auto-convert everything, they might be open to it. But yeah, the possibilities of auto-transpiling huge amounts of software for modernization purposes is a really interesting application of AI, amazing to think of all the possibilities. But I'm happy to have read the article because I certainly didn't think about the copyright implications.
If you really want to do that, the sensible thing is to keep it separate from the original and respect the original license. There would have been no outcry if that happened with chardet. If the different package is genuinely better, it will be used.
I think your last point raises the following question: how would you change your answer if you know they read all about guns and death and how one causes the other? What if they'd seen pictures of guns? And pictures of victims of guns annotated as such? What if they'd seen videos of people being shot by guns?
I mean I sort of understand what you're trying to say but in fact a great deal of knowledge we get about the world we live in, we get second hand.
There are plenty of people who've never held a gun, or had a gun aimed at them, and.. granted, you could argue they probably wouldn't read that line the same way as people who have, but that doesn't mean that the average Joe who's never been around a gun can't enjoy media that features guns.
Same thing about lots of things. For instance it's not hard for me to think of animals I've never seen with my own eyes. A koala for instance. But I've seen pictures. I assume they exist. I can tell you something about their diet. Does that mean I'm no better than an LLM when it comes to koala knowledge? Probably!
It’s more complicated to think about, but it’s still the same result. Think about the structure of a dictionary: all of the words are defined in terms of other words in the dictionary, but if you’ve never experienced reality as an embodied person then none of those words mean anything to you. They’re as meaningless as some randomly generated graph with a million vertices and a randomly chosen set of edges according to some edge distribution that matches what we might see in an English dictionary.
Bringing pictures into the mix still doesn’t add anything, because the pictures aren’t any more connected to real world experiences. Flooding a bunch of images into the mind of someone who was blind from birth (even if you connect the images to words) isn’t going to make any sense to them, so we shouldn’t expect the LLM to do any better.
Think about the experience of a growing baby, toddler, and child. This person is not having a bunch of training data blasted at them. They’re gradually learning about the world in an interactive, multi-sensory and multi-manipulative manner. The true understanding of words and concepts comes from integrating all of their senses with their own manipulations as well as feedback from their parents.
Children also are not blank slates, as is popularly claimed, but come equipped with built-in brain structures for vision, including facial recognition, voice recognition (the ability to recognize mom’s voice within a day or two of birth), universal grammar, and a program for learning motor coordination through sensory feedback.
> The hardest part was figuring out OpenLDAPs configuration syntax, especially the correct ldif incantations ..
As a long time Linux user on personal machines, I found myself for the first time a couple of years ago needing to support a small team and given them all login access to our small cluster. I figured, hey it's annoying to coordinate user ids over these machines, I should just set up OpenLDAP.. little did I know.. honestly I'm pretty handy at dealing with Linux but I was shocked to discover how complicated and annoying it was to set up and use OpenLDAP with NFS automounting home directories.
For the first time in my life I was like, "oh this is why people spend years studying system administration.."
I did get it working eventually but it was hard to trust it and the configuration GUI was not very good and I never fully got passwd working properly so I had to intervene to help people change their passwords.. in the end we ended up just using manually coordinated local accounts.
The whole time I'm just thinking, I must be missing something, it can't be this bad.. I'm still a bit flabbergasted by the experience.
reply