Unusual concept. Will split my thoughts on implementation and adoption (in regards to design).
Implementation-wise: Tried myself something similar. One language (same core & lib & built-ins) with 2 front-ends that had different syntax and, but similar, semantics. (Didn't go far though.) An issue is not favoring one over the other. Inevitably, in a meta way, you'll if decide to self-host since will've to pick one form to do it. Also, having multiple co-existing forms in same file may complicate tooling.
About the last part, most similar things I've seen are: (i) Perl Mojo's embedded templates which can be included in same file with source code, (ii) Racket's #lang multi-file which allows combining, well, multiple files (thus also use different #lang directives) in same file.
Adoption-wise: It's in a weird position for widespread adoption. There's strong preference towards using a single language which splits into 2 branches: (1) using single language across every layer (basically Rust/Zig), (2) using high-level language with native-competive performance (Python+numpy, jax, etc / JS+ultrafast JIT / Julia).
Currently you target both (one base language) and none (different syntax/semantics). Could move towards an hybrid approach instead by having one syntax and high-level / low-level forms (uncertain what distinguishes kernel/baremetal currently). So some functionality may end up showing differently in the 2 cases to be more acceptable by both camps. This will probably also simplify tooling creation/maintenance.
Of course, since the project is quite experimental in nature, keeping it current way is interesting and very acceptable.
This is fair feedback, and you’re pointing at the main tradeoff intentionally.
One clarification that might help: Falcon isn’t multiple frontends or multiple grammars in the usual sense. The parser accepts all code into a single AST, but during IR lowering the compiler is invoked with exactly one active profile. Nodes whose profile doesn’t match are not lowered to IR at all — they’re rejected before borrow checking, optimization, or codegen.
From the compiler’s point of view, the other profiles never existed. There’s no runtime guard, no macro-style inclusion, and no shared assumptions leaking across profiles.
The goal isn’t to let people freely mix levels like unsafe {} in Rust, but to make domain boundaries explicit and enforceable. Kernel/baremetal code has fundamentally different invariants (no heap, no panic, different aliasing rules), and soft escape hatches tend to blur those over time.
That said, I agree this does increase tooling complexity and may reduce adoption. This is very much an experiment to see whether hard separation + single IR is a better tradeoff for certain projects than one-size-fits-all semantics.
Appreciate the comparison examples — they’re useful references.
Crazy. But then again Unix-like kernels/systems have been implemented so many times as hobby or/and part of university courses, that models are probably overfit on them. Seeing Nano, a interesting twist (besides human-free) will've been to instead target Nano rather C.
This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.
It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further.
I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.
I've not written a VM before, but the comments in perf_takehome.py and problem.py explain the basics of this.
I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).
Same thought. I doubt they provided additional explanation to candidates - it seems that basic code literacy within the relevant domain is one of the first things being tested.
On the whole I don't think I'd perform all that well on this task given a short time limit but it seems to me to be an extremely well designed task given the stated context. The reference kernel easily fits on a single screen and even the intrinsic version almost does. I think this task would do a good job filtering the people they don't want working for them (and it seems quite likely that I'm borderline or maybe worse by their metric).
This tool seems agent-oriented for them to merge, rather merely check readiness. In that regard, the page doesn't mention anything about human reviewers, only AI reviewers. Honestly wouldn't be surprised if author, someone seemingly running fully agentic workflows, didn't even consider human reviewers. If it's AI start-to-end*, then yes, quite possibly could push directly to master without much difference.
Call me pessimistic, and considering [1][2][3] (and other similar articles/discussions), believe this tool will be most useful to AI PR spammers the moment is modified to also parse non-AI PR comments.
*Random question: is it start-to-end or end-to-end?
edit: P.S. Agree that it's useful, given its design goals, tool though.
That repo is quintessentially surreal. AI-written code, published in AI-made PRs, reviewed by multiple AI bots (one of which being same model that wrote code & made the PR, maybe others too just accessible via 3rd vendor), merged by AI (assuming dogfooding).
Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.
Their Readme.md is weirdly obsessed with "2 hours":
"before Claude Opus 4.5 started doing better than humans given only 2 hours"
"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
"Claude Opus 4.5 after 2 hours in our test-time compute harness"
"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
Would be quite curious to know though. How I usually design take home assignments is:
1. Candidate has several _days_ to complete (usually around a week).
2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.
Fair enough. I feel like designing AI-proof take-homes is getting ever more futile. Given the questions need to be sufficiently low context to be human-doable in a short time and timespans for AI tasks increasing, I'm not sure take homes can actually serve any filtering function whatsoever, besides checking if applicants are willing to put in a minimal amount of effort.
Cool concept. (Legit project overall too.) Any chance it expands beyond Claude, e.g. Codex, OpenCode? Also, unless misunderstood those commits happen in currently working branch? If yes, an option to have the code/sessions mix in alternative (not working) branch will be nice too, as not every project would want to fill history with sessions.
On expanding beyond Claude: there’s no concrete plan right now since we built this around Claude, but we’re very open to it. If you have a preferred CLI (e.g., Codex, OpenCode, or something else), feel free to open an issue in the repo. Or just describe your use case here and I will do it :)
Regarding branches: the tool does not pollute your working branch. Each session lives on a separate “session” branch that contains all prompts and operations. Your normal working branch stays clean.
When you end a session, you’re prompted to either:
merge the code changes into your working branch, or
discard them.
If the video or README didn’t make this clear enough, I’d appreciate the feedback I’ll update the docs accordingly.
>Managing agents, crafting skills, building docs, designing workflows
You're describing the modern edition of people obsessed with their "development" environments. The ones who treated their system (usually Linux) and text editor (usually Vim or Emacs) like a canvas, perfecting their configuration the way an artist refines a masterwork. Choosing packages and themes like a painter choosing brushes. Younger people of this mindset are now obsessed with multiple LLMs, multi-agent workflows, MCPs, and similar.
In contrast, there's the modern version of the people who used to just open an IDE and copy-paste snippets until they got the result they wanted. Now, those same people simply open Claude Code and prompt: "make me this app", "modify this", "do this more like that", and so on. Those are vibe coders. The only thing that's changed is a lower barrier, less effort, and faster development; yet somehow higher quality since SOTA LLMs output better code than most juniors used to.
And last there's the midway. People who set up their environment, without it becoming the main focus.
That's an interesting point. One, I wish I made haha. This article is for people who are "into" this stuff (tech). Who live and breathe it. Who've been doing it as a kid and just getting into agentic coding.
Implementation-wise: Tried myself something similar. One language (same core & lib & built-ins) with 2 front-ends that had different syntax and, but similar, semantics. (Didn't go far though.) An issue is not favoring one over the other. Inevitably, in a meta way, you'll if decide to self-host since will've to pick one form to do it. Also, having multiple co-existing forms in same file may complicate tooling.
About the last part, most similar things I've seen are: (i) Perl Mojo's embedded templates which can be included in same file with source code, (ii) Racket's #lang multi-file which allows combining, well, multiple files (thus also use different #lang directives) in same file.
Adoption-wise: It's in a weird position for widespread adoption. There's strong preference towards using a single language which splits into 2 branches: (1) using single language across every layer (basically Rust/Zig), (2) using high-level language with native-competive performance (Python+numpy, jax, etc / JS+ultrafast JIT / Julia).
Currently you target both (one base language) and none (different syntax/semantics). Could move towards an hybrid approach instead by having one syntax and high-level / low-level forms (uncertain what distinguishes kernel/baremetal currently). So some functionality may end up showing differently in the 2 cases to be more acceptable by both camps. This will probably also simplify tooling creation/maintenance.
Of course, since the project is quite experimental in nature, keeping it current way is interesting and very acceptable.
TL;DR Yes (~templating) - Yes (complexity, lower potential adoption) - No (unusual, experimental)
reply