bd01's comments

bd01 · on Jan 2, 2025

This is pretty bad. Let's start with the very first instruction:

  mov rax, 1

An actual "mov rax, 1" would assemble to 48 B8 01 00 00 00 00 00 00 00, a whopping TEN bytes.

nasm will optimize this to the equivalent "mov eax, 1", that's 6 bytes, but still:

  xor eax, eax ; 2 bytes
  inc eax      ; 2 bytes

would be much smaller. Second line:

  mov rdi, 1

You already have the value 1 in eax, so a "mov edi, eax" (two bytes) would suffice. Etc. etc.

xpasky · on Jan 2, 2025

  push 1
  pop rax

is even shorter (credit: https://old.reddit.com/r/programming/comments/q6mnz1/what_is...)

musicale · on Jan 3, 2025

I feel like I shouldn't love x86 encoding, but there is something charming about this. Probably echoing its 8-bit predecessors. It seems like it's designed for tiny memory environments (embedded, bootstrapping, etc.) where you don't mind taking a hit for memory access.

rep_lodsb · on Jan 2, 2025

Linux initializes all general purpose registers to zero. It's not documented AFAIK, but should be reliable - it has to init them to some value anyway to avoid leaking kernel state. So you can get away with:

    mov     al,1       ;write
    mov     edi,eax    ;handle=stdout
    mov     esi,msg    ;assumes load address below 4G
    mov     dl,msg.len
    syscall
    mov     al,60      ;assuming syscall succeeded, EAX was bytes written
    xor     edi,edi
    syscall

The load address stays constant unless there's some magic GNU extension header to enable ASLR. If we could get the code loaded below 64K, we could save another byte by using SI instead of ESI; however this doesn't work by default, you'd have to run 'echo 0 > /proc/sys/vm/mmap_min_addr' as root first.

bd01 · on Jan 2, 2025

Initial register state is documented to be undefined except for rbp, rsp and rdx [1].

Can you say for certain that no other Linux version ever used GPRs to pass something else?

[1] System V ABI, page 29 (last line) and 30, https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf

rep_lodsb · on Jan 2, 2025

For certain? No, but I wouldn't expect it. Not sure what that function pointer in rdx is intended for, but Linux doesn't use it.

(Note for pedants: rsp is technically a "general purpose register", but of course it is initialized to point to the userspace stack instead of zero.)

retrac · on Jan 2, 2025

Assuming it is initial zero

   inc eax

is a byte shorter than mov al, 1

rep_lodsb · on Jan 2, 2025

Yes, but only in 32 bit mode. Not that it matters, except for the hypothetical future processor or Linux kernel that is no longer compatible with that :)

michidk · on Jan 3, 2025

I was able to shave off one additional byte with this:

  ...
  xor rax, rax       ; = 0
  inc rax            ; = 1 - syscall: sys_write
  mov rdi, rax       ; copy 1 - file descriptor: stdout
  lea rsi, [rel msg] ; pointer to message
  mov rdx, 14        ; message length
  syscall
  ...

  $ nasm -f bin -o elf elf.asm; wc -c elf; ./elf
  166 elf
  Hello, World!

So I guess NASM already optimizes this quite well

However, using the stack-based instructions as xpasky hinted at:

  ...
  push 1             ; syscall: sys_write
  pop rax
  pop rdi       ; copy 1 - file descriptor: stdout
  lea rsi, [rel msg] ; pointer to message
  push 14            ; message length
  pop rdx
  syscall
  ...

I get down to 159 bytes! I updated the article to reflect that

bd01 · on Jan 3, 2025

That second snippet is pretty funny:

  push 1
  pop rax
  pop rdi

You can't push a value once and pop it twice, that's not how a stack works! You're popping something else off the stack. So why does this even work?

Linux passes your program arguments on the stack, with argc on top. So when you don't pass any arguments, argc just HAPPENS to be 1. Which you then pop into rdi. Gross!

michidk · on Jan 4, 2025

Of course - you are completely right, an oversight in wanting to correct my mistake as quickly as possible.

With that fixed, is there any reason not to use push here?

bd01 · on Jan 5, 2025

Yes, because:

  push 1       ; 6A 01 (2 bytes)
  pop rdi      ; 5F    (1 byte)

is longer than a simple:

  mov edi, eax ; 89 C7 (2 bytes)

michidk · on Jan 5, 2025

I think your statement might only apply to 32 bit (one of the constraints mentioned early in the blog post was 64 bit).

But even if it was 32 bit, then we would't have to copy a 1, since the syscall number for sys_write would be 4 instead of 1.

I get the same total size with both variants in 64 bit mode.

  push 1
  pop rax
  mov rdi, rax

Assembling to 48 89 C7 (3 bytes)

seems to be same in size as

  push 1
  pop rax
  push 1
  pop rdi

Assembling to 6A 01 5F (3 bytes)

bd01 · on Jan 5, 2025

That's because you're using `mov rdi, rax` again. You keep changing `edi, eax` to `rdi, rax`. Why?

The default operand size in 64-bit mode is, for most instructions, still 32 bits. So `mov edi, eax` encodes the same in 32- and 64-bit mode.

For `mov rdi, rax` you need an extra REX prefix byte [1], that's the 48 you're seeing above, but you don't need it here.

[1] https://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefi...

michidk · on Jan 5, 2025

okay, I didn't know that, thanks for the background. I wonder why the assembler would not optimize this though.

I noticed that I then could also shave of one byte more by using lea esi, [rel msg] instead of lea rsi, [rel msg].

michidk · on Jan 4, 2025

should be ... push 1 ; syscall: sys_write pop rax push 1 pop rdi

of course

michidk · on Jan 3, 2025

Thanks, that makes total sense. I was so focused on the ELF part that I didn't even consider optimizing the initial assembly further. Will fix it and edit the article.