> Using `lea` […] is useful if both of the operands are still needed later on in other calculations (as it leaves them unchanged)
As well as making it possible to preserve the values of both operands, it’s also occasionally useful to use `lea` instead of `add` because it preserves the CPU flags.
Funny to see a comment on HN raising this exact point, when just ~2 hours ago I was writing inline asm that used `lea` precisely to preserve the carry flag before a jump table! :)
I'm not them but whenever I've used it it's been for arch specific features like adding a debug breakpoint, synchronization, using system registers, etc.
Never for performance. If I wanted to hand optimise code I'd be more likely to use SIMD intrinsics, play with C until the compiler does the right thing, or write the entire function in a separate asm file for better highlighting and easier handing of state at ABI boundary rather than mid-function like the carry flags mentioned above.
Generally inline assembly is much easier these days as a) the compiler can see into it and make optimizations b) you don’t have to worry about calling conventions
> the compiler can see into it and make optimizations
Those writing assembler typically/often think/know they can do better than the compiler. That means that isn’t necessarily a good thing.
(Similarly, veltas comment above about “play with C until the compiler does the right thing” is brittle. You don’t even need to change compiler flags to make it suddenly not do the right thing anymore (on the other hand, when compiling for a different version of the CPU architecture, the compiler can fix things, too)
It's rare that I see compiler-generated assembly without obvious drawbacks in it. You don't have to be an expert to spot them. But frequently the compiler also finds improvements I wouldn't have thought of. We're in the centaur-chess moment of compilers.
Generally playing with the C until the compiler does the right thing is slightly brittle in terms of performance but not in terms of functionality. Different compiler flags or a different architecture may give you worse performance, but the code will still work.
“Advanced chess is a form of chess in which each human player uses a computer chess engine to explore the possible results of candidate moves. With this computer assistance, the human player controls and decides the game.
Also called cyborg chess or centaur chess, advanced chess was introduced for the first time by grandmaster Garry Kasparov, with the aim of bringing together human and computer skills to achieve the following results:
- increasing the level of play to heights never before seen in chess;
- producing blunder-free games with the qualities and the beauty of both perfect tactical play and highly meaningful strategic plans;
- offering the public an overview of the mental processes of strong human chess players and powerful chess computers, and the combination of their forces.”
> “play with C until the compiler does the right thing” is brittle
It's brittle depending on your methods. If you understand a little about optimizers and give the compiler the hints it needs to do the right things, then that should work with any modern compiler, and is more portable (and easier) than hand-optimizing in assembly straight away.
Of course you can often beat the compiler, humans still vectorize code better. And that interpreter/emulator switch-statement issue I mentioned in the other comment. There are probably a lot of other small niches.
In general case you're right. Modern compilers are beasts.
Might be an interpreter or an emulator. That’s where you often want to preserve registers or flags and have jump tables.
This is one of the remaining cases where the current compilers optimize rather poorly: when you have a tight loop around a huge switch-statement, with each case-statement performing a very small operation on common data.
In that case, a human writing assembler can often beat a compiler with a huge margin.
I'm curious if that's still the case generally after things like musttail attributes to help the compiler emit good assembly for well structured interpreter loops:
> x86 is unusual in mostly having a maximum of two operands per instruction[2]
Perhaps interesting for those who aren't up to date, the recent APX extension allows 3-operand versions of most of the ALU instructions with a new data destination, so we don't need to use temporary registers - making them more RISC-like.
The downside is they're EVEX encoded, which adds a 4-byte prefix to the instruction. It's still cheaper to use `lea` for an addition, but now we will be able to do things like
Loving this series! I'm currently implementing a z80 emulator (gameboy) and it's my first real introduction to CISC, and is really pushing my assembly / machine code skills - so having these blog posts coming from the "other direction" are really interesting and give me some good context.
I've implemented toy languages and bytecode compilers/vms before but seeing it from a professional perspective is just fascinating.
That being said it was totally unexpected to find out we can use "addresses" for addition on x86.
A seasoned C programmer knows that "&arr[index]" is really just "arr + index" :) So in a sense, the optimizer rewrote "x + y" into "(int)&(((char*)x)[y])", which looks scarier in C, I admit.
This is, I am sure, one of the stupid legacy reasons we still write "lr a0, 4(a1)" instead of more sensible "lr a0, a1[4]". The other one is that FORTRAN used round parentheses for both array access and function calls, so it stuck somehow.
Generally such constant offsets are record fields in intent, not array indices. (If they were array indices, they'd need to be variable offsets obtained from a register, not immediate constants.) It's reasonable to think of record fields as functions:
.equ car, 0
.equ cdr, 8
.globl length
length: test %rdi, %rdi # nil?
jz 1f # return 0
mov cdr(%rdi), %rdi # recurse on tail of list
call length
inc %rax
ret
1: xor %eax, %eax
ret
To avoid writing out all the field offsets by hand, ARM's old assembler and I think MASM come with a record-layout-definition thing built in, but gas's macro system is powerful enough to implement it without having it built into the assembler itself. It takes about 13 lines of code: http://canonical.org/~kragen/sw/dev3/mapfield.S
Alternatively, on non-RISC architectures, where the immediate constant isn't constrained to a few bits, it can be the address of an array, and the (possibly scaled) register is an index into it. So you might have startindex(,%rdi,4) for the %rdi'th start index:
.data
startindex:
.long 1024
.text
.globl length
length: mov (startindex+4)(,%rdi,4), %eax
sub startindex(,%rdi,4), %eax
ret
If the PDP-11 assembler syntax had been defined to be similar to C or Pascal rather than Fortran or BASIC we would, as you say, have used startindex[%rdi,4].
This is not very popular nowadays both because it isn't RISC-compatible and because it isn't reentrant. AMD64 in particular is a kind of peculiar compromise—the immediate "offset" for startindex and endindex is 32 bits, even though the address space is 64 bits, so you could conceivably make this code fail to link by placing your data segment in the wrong place.
(Despite stupid factionalist stuff, I think I come down on the side of preferring the Intel syntax over the AT&T syntax.)
Yes, I find this one of the weird things about assembly - appending (or pretending?) a number means addition?! - even after many many years of occasionally reading/writing assembly, I’m never completely sure what these instructions do so I infer from context.
Not in C no, since arithmetic on a pointer is implicitly scaled by the size of the value being pointed at (this statement is kind of breaking the abstraction ... oh well).
As a side note of appreciation, I think that we can't do better than what he did for being transparent that LLM was used but still just for the proof-reading.
Agreed that it's nice he acknowledged it, but proof reading is about as innocuous of a task for LLMs as they come. Because you actually wrote the content and know its meaning (or at least intended meaning), you can instantly tell when to discard anything irrelevant from the LLM. At worst, it's no better than just skipping that review step.
It does make me curious about the what the super anti-ai people will do.
Matt Godbolt is, obviously, extremely smart and has a lot of interesting insight as a domain expert. But... this was LLM-assisted.
So, anyone who has previously said they'll never (knowingly) read anything that an ai has touched (or similar sentiment) are you going to skip this series? Make an exception?
I think most people wouldn't call proof-reading 'assistance'. As in, if I ask a colleague to review my PR, I wouldn't say he assisted me.
I've been throwing my PR diffs at Claude over the last few weeks. It spits a lot of useless or straight up wrong stuff, but sometimes among the insanity it manages to get one or another typo that a human missed, and between letting a bug pass or spending extra 10m per PR going through the nothingburguers Claude throws at me, I'd rather lose the 10m.
Just not use it. I couldn't care less if other people spend hours prompt engineering to get something that approaches useful output. If they want their reputation staked on it's output that's on them. The results are already in and they're not pretty.
I just personally think it's absurd to spend trillions of dollars and watts to create an advanced spell checker. Even more so to see this as a "revolution" of any sort or to not expect a new AI-winter once this bubble pops.
This triggers a vague memory of trying to figure out why my assembler (masm?) was outputting a LEA instead of a MOV. I can't remember why. Maybe LEA was more efficient, or MOV didn't really support the addressing mode and the assembler just quietly fixed it for you.
In any case, I felt slightly betrayed by the assembler for silently outputting something I didn't tell it to.
LEA and MOV are doing different things. LEA is just calculating the effective address, but MOV calculates the address then retrieves the value stored at that address.
e.g. If base + (index * scale) + offset = 42, and the value at address 42 is 3, then:
LEA rax, [base + index * scale + offset] will set rax = 42
MOV rax, [base + index * scale + offset] will set rax = 3
Hey now; let's not get ahead too far :) I'm trying to keep each one bite-sized...I don't think you'll be (too) disappointed at the next few episodes :)
>However, in this case it doesn’t matter; those top bits5 are discarded when the result is written to the 32-bit eax.
>Those top bits should be zero, as the ABI requires it: the compiler relies on this here. Try editing the example above to pass and return longs to compare.
Sorry, I don't understand. How could the compiler both discard the top bits, and also rely on the top bits being zero? If it's discarding the top bits, it won't matter whether the top bits are zero or not, so it's not relying on that.
He's actually wrong on the ABI requiring the top bits to be 0. It only requires that the bottom 32 bits match the parameter, but the top bits of a 32-bit parameter passed in a 64-bit register can be anything (at least on Linux).
The reason the code in his post works is because the upper 32 bits of the parameters going into an addition can't affect the low 32 bits of the result, and he's only storing the low 32 bits.
The LLVM x86-64 ABI requires the top bits to be zero. GCC treats them as undefined. Until a recent clarification, the x86-64 psABI made the upper bits undefined by omission only, which is why I think most people followed the GCC interpretation.
In theory. In practice the vast majority of Linux userland programs are compiled with GCC so unless GCC did something particularly braindead they are unlikely to break compatibility with that and so it's the ABI everyone needs to target. Which is also what happened in this case: The standard was updated to mandate the GCC behavior.
(Almost) any instruction on x64 that writes to a 32-bit register as destination, writes the lower 32-bits of the value into the lower 32 bits of the full 64-bit register and zeroes out the upper 32 bits of the full register. He touched on it in his previous note "why xor eax, eax".
But the funny thing is, the x64-specific supplement for SysV ABI doesn't actually specify whether the top bits should be zeroes or not (and so, if the compiler could rely on e.g. function returning ints to have upper 32 bits zeroes, or those could be garbage), and historically GCC and Clang diverged in their behaviour.
> However, in this case it doesn’t matter; those top bits are discarded when the result is written to the 32-bit eax.
Fun (but useless) fact: This being x86, of course there are at least three different ways [1] to encode this instruction: the way it was shown, with an address size override prefix (giving `lea eax, [edi+esi]`), or with both a REX prefix and an address size override prefix (giving `lea rax, [edi+esi]`).
And if you have a segment with base=0 around you can also add in a segment for fun: `lea rax, cs:[edi+esi]`
[1]: not counting redundant prefixes and different ModRMs
LEA is a beautiful example of instruction reuse. Designed for pointer arithmetic, repurposed for efficient addition. It's a reminder that good ISA design leaves room for creative optimization - and that compilers can find patterns human assembly programmers might miss.
Human assembly programmers on the 8086 used LEA all the fucking time. And I'm not sure good ISA design is characterized by the need for ingenious hacks to get the best mileage out of the hardware; rather the opposite, in my view. The ARM2's ISA design is head and shoulders better than the 8086's.
This trick is something we teach our students when we do 6809 assembly (mainly as a trick to do addition on the index registers). I had no idea it was used as an optimisation in x86.
What's the current best resources to learn assembly? So that I can understand output of simple functions. I don't want to learn to write it properly, I just want to be able to understand on what's happening.
You can select the assembly output (I like RISCV but you can pick ARM, x86, mips, etc with your choice of compiler) and write your own simple functions. Then put the original function and the assembly output into an LLM prompt window and ask for a line-by-line explanation.
Also very useful to get a copy of Computer Organization and Design RISC-V Edition: The Hardware Software Interface, by Patterson and Hennessy.
Honestly, x86 is not nearly as CISC as those go. It just has a somewhat developed addressing modes comparing to the utterly anemic "register plus constant offset" one, and you are allowed to fold some load-arithmetic-store combinations into a single instruction. But that's it, no double- or triple-indexing or anything like what VAXen had.
One of my biggest bugbears in CS instruction is the overdue emphasis on RISC v CISC, especially as there aren't any really good models to show you what the differences are, given the winnowing of ISAs. In John Mashey's infamous posts [1] sort of delineating an ordered list from most RISCy to most CISCy, the architectures that are the most successful have been the ones that really crowded the RISC/CISC line--ARM and x86.
It also doesn't help that, since x86 is the main goto example for CISC, people end up not having a strong grasp on what features of x86 make it actually CISC. A lot of people go straight to its prefix encoding structure or its ModR/M encoding structure, but honestly, the latter is pretty much just a "compressed encoding" of RISC-like semantics, and the former is far less insane than most people give it credit for. But x86 does have a few weird, decidedly-CISC instruction semantics in it--these are the string instructions like REP MOVSB. Honestly, take out about a dozen instructions, and you could make a solid argument that modern x86 is a RISC architecture!
> these are the string instructions like REP MOVSB
AArch64 nowadays has somewhat similar CPY* and SET* instructions. Does that make AArch64 CISC? :-) (Maybe REP SCASB/CMPSB/LODSB (the latter being particularly useless) is a better example.)
The classic distinction is that a CISC has data processing instructions with memory operands, and in a RISC they only take register parameters. This gets fuzzy though when you look at AArch64 atomic instructions like ldadd which do read-modify-write all in a single instruction.
Eh, that's really just a side effect of almost 50 years of constant evolution from a 8-bit microprocessor. Take look at VAX [0], for instance: its instruction encoding is pretty clean yet it's an actual example of a CISC ISA that was impossible to speed up like, literally: DEC engineers tried very hard and concluded that making a truly pipelined & super-scalar implementation was basically impossible; so DEC had to move to Alpha. See [1] for more from John Mashey.
Edit: the very, very compressed TL;DR is that if you do only one memory load (or one memory load + store back into this exact location) per instruction, it scales fine. But the moment you start doing chained loads, with pre- and post-increments which are supposed to write back changed values into the memory and be visible, and you have several memory sources, and your memory model is actually "strong consistency", well, you're in a world of pain.
Would this matter for performance? You already have so many execution units that are actually difficult to keep fully fed even when decoding instructions and data at the speed of cache.
Yes. As Joker_vD hints on a sibling comment, this is what killed all the classic CISCs during the OoO transition except for x86 that lacks the more complex addressing modes (and the PPro was still considered a marvel of engineering that was assumed not to be possible).
Do we really know that LEA is using the hardware memory address computation units? What if the CPU frontend just redirects it to the standard integer add units/execution ports? What if the hardware memory address units use those too?
It would be weird to have 2 sets of different adders.
The modern Intel/AMD CPUs have distinct ALUs (arithmetic-logic units, where additions and other integer operations are done; usually between 4 ALUs and 8 ALUs in recent CPUs) and AGUs (address generation units, where the complex addressing modes used in load/store/LEA are computed; usually 3 to 5 AGUs in recent CPUs).
Modern CPUs can execute up to between 6 and 10 instructions within a clock cycle, and up to between 3 and 5 of those may be load and store instructions.
So they have a set of execution units that allow the concurrent execution of a typical mix of instructions. Because a large fraction of the instructions generate load or store micro-operations, there are dedicated units for address computation, to not interfere with other concurrent operations.
Not too versed here, but given that ADD seems to have more execution ports to pick from (e.g. on Skylake), I'm not sure that's an argument in favor of lea. I'd guess that LEA not touching flags and consuming fewer uops (comparing a single simple LEA to 2 ADDs) might be better for out of order execution though (no dependencies, friendlier to reorder buffer)
But can the frontend direct these computations based on what's available? If it sees 10 LEA instructions in a row, and it has 5 AGU units, can it dispatch 5 of those LEA instructions to other ALUs?
Or is it guaranteed that a LEA instruction will always execute on an AGU, and an ADD instruction always on an ALU?
No recent Intel/AMD CPU executes directly LEA or other instructions, they are decoded into 1 or more micro-operations.
The LEA instructions are typically decoded into either 1 or 2 micro-operations. The addressing modes that add 3 components are usually decoded into 2 micro-operations, like also the obsolete 16-bit addressing modes.
The AGUs probably have some special forwarding paths for the results towards the load/store units, which do not exist in ALUs. So it is likely that 1 of the up to 2 LEA micro-operations are executed only in AGUs. On the other hand, when there are 2 micro-operations it is likely that 1 of them can be executed in any ALU. It is also possible for the micro-operations generated by a LEA to be different from those of actual load/store instructions, so that they may also be executed in ALUs. This is decided by the CPU designer and it would not be surprising if LEAs are processed differently in various CPU models.
> It would be weird to have 2 sets of different adders.
Not really. CPUs often have limited address math available separately from the ALU. On simple cores, it looks like a separate incrementer for the Program Counter, on x86 you have a lot of addressing modes that need a little bit of math; having address units for these kinds of things allows more effective pipelining.
> Do we really know that LEA is using the hardware memory address computation units?
There are ways to confirm. You need an instruction stream that fully loads the ALUs, without fully loading dispatch/commit, so that ALU throughput is the limit on your loop; then if you add an LEA into that instruction stream, it shouldn't increase the cycle count because you're still bottlenecked on ALU throughput and the LEA does address math separately.
You might be able to determine if LEAs can be dispatched to the general purpose ALUs if your instruction stream is something like all LEAs... if the throughput is higher than what could be managed with only address units, it must also use ALUs. But you may end up bottlenecked on instruction commit rather than math.
It (LEA) does all the work of a memory access (the address computation part) without actually performing the memory access.
Instead of reading from memory at "computed address value" it returns "computed address value" to you to use elsewhere.
The intent was likely to compute the address values for MOVS/MOVSB/MOVSW/MOVSD/MOVSQ when setting up a REP MOVS (or other repeated string operation). But it turned out they were useful for doing three operand adds as well.
It's due to the way the instruction is encoded. `lea` would've needed special treatment in syntax to remove the brackets.
In `op reg1, reg2`, the two registers are encoded as 3 bits each the ModRM byte which follows the opcode. Obviously, we can't fit 3 registers in the ModRM byte because it's only 8-bits.
In `op reg1, [reg2 + reg3]`, reg1 is encoded in the ModRM byte. The 3 bits that were previously used for reg2 are instead `0b100`, which indicates a SIB byte follows the ModRM byte. The SIB (Scale-Index-Base) byte uses 3 bits each for reg2 and reg3 as the base and index registers.
In any other instruction, the SIB byte is used for addressing, so syntax of `lea` is consistent with the way it is encoded.
When you encode an x86 instruction, your operands amount to either a register name, a memory operand, or an immediate (of several slightly different flavors). I'm no great connoisseur of ISAs, but I believe this basic trichotomy is fairly universal for ISAs. The operands of an LEA instruction are the destination register and a memory operand [1]. LEA happens to be the unique instruction where the memory operand is not dereferenced in some fashion in the course of execution; it doesn't make a lot of sense to create an entirely new syntax that works only for a single instruction.
[1] On a hardware level, the ModR/M encoding of most x86 instructions allows you to specify a register operand and either a memory or a register operand. The LEA instruction only allows a register and a memory operand to be specified; if you try to use a register and register operand, it is instead decoded as an illegal instruction.
> LEA happens to be the unique instruction where the memory operand is not dereferenced
Not quite unique: the now-deprecated Intel MPX instructions had similar semantics, e.g. BNDCU or BNDMK. BNDLDX/BNDSTX are even weirder as they don't compute the address as specified but treat the index part of the memory operand separately.
The way I rationalize it is that you're getting the address of something. A raw address isn't what you want the address of, so you're doing something like &(*(rdi+rsi)).
LEA stands for Load Effective Address, so the syntax is as-if you're doing a memory access, but you are just getting the calculated address, not reading or writing to that address.
LEA would normally be used for things like calculating address of an array element, or doing pointer math.
> Using `lea` […] is useful if both of the operands are still needed later on in other calculations (as it leaves them unchanged)
As well as making it possible to preserve the values of both operands, it’s also occasionally useful to use `lea` instead of `add` because it preserves the CPU flags.
Funny to see a comment on HN raising this exact point, when just ~2 hours ago I was writing inline asm that used `lea` precisely to preserve the carry flag before a jump table! :)
I'm curious, what are you working on that requires writing inline assembly?
I'm not them but whenever I've used it it's been for arch specific features like adding a debug breakpoint, synchronization, using system registers, etc.
Never for performance. If I wanted to hand optimise code I'd be more likely to use SIMD intrinsics, play with C until the compiler does the right thing, or write the entire function in a separate asm file for better highlighting and easier handing of state at ABI boundary rather than mid-function like the carry flags mentioned above.
Generally inline assembly is much easier these days as a) the compiler can see into it and make optimizations b) you don’t have to worry about calling conventions
> the compiler can see into it and make optimizations
Those writing assembler typically/often think/know they can do better than the compiler. That means that isn’t necessarily a good thing.
(Similarly, veltas comment above about “play with C until the compiler does the right thing” is brittle. You don’t even need to change compiler flags to make it suddenly not do the right thing anymore (on the other hand, when compiling for a different version of the CPU architecture, the compiler can fix things, too)
It's rare that I see compiler-generated assembly without obvious drawbacks in it. You don't have to be an expert to spot them. But frequently the compiler also finds improvements I wouldn't have thought of. We're in the centaur-chess moment of compilers.
Generally playing with the C until the compiler does the right thing is slightly brittle in terms of performance but not in terms of functionality. Different compiler flags or a different architecture may give you worse performance, but the code will still work.
Centaur-chess?
https://en.wikipedia.org/wiki/Advanced_chess:
“Advanced chess is a form of chess in which each human player uses a computer chess engine to explore the possible results of candidate moves. With this computer assistance, the human player controls and decides the game.
Also called cyborg chess or centaur chess, advanced chess was introduced for the first time by grandmaster Garry Kasparov, with the aim of bringing together human and computer skills to achieve the following results:
- increasing the level of play to heights never before seen in chess;
- producing blunder-free games with the qualities and the beauty of both perfect tactical play and highly meaningful strategic plans;
- offering the public an overview of the mental processes of strong human chess players and powerful chess computers, and the combination of their forces.”
> “play with C until the compiler does the right thing” is brittle
It's brittle depending on your methods. If you understand a little about optimizers and give the compiler the hints it needs to do the right things, then that should work with any modern compiler, and is more portable (and easier) than hand-optimizing in assembly straight away.
Of course you can often beat the compiler, humans still vectorize code better. And that interpreter/emulator switch-statement issue I mentioned in the other comment. There are probably a lot of other small niches.
In general case you're right. Modern compilers are beasts.
Might be an interpreter or an emulator. That’s where you often want to preserve registers or flags and have jump tables.
This is one of the remaining cases where the current compilers optimize rather poorly: when you have a tight loop around a huge switch-statement, with each case-statement performing a very small operation on common data.
In that case, a human writing assembler can often beat a compiler with a huge margin.
I'm curious if that's still the case generally after things like musttail attributes to help the compiler emit good assembly for well structured interpreter loops:
https://blog.reverberate.org/2025/02/10/tail-call-updates.ht...
I worked on a C codebase once, integrating an i2c sensor. The vendor only had example code in asm. I had to learn to inline asm.
It still happens in 2025
[flagged]
> x86 is unusual in mostly having a maximum of two operands per instruction[2]
Perhaps interesting for those who aren't up to date, the recent APX extension allows 3-operand versions of most of the ALU instructions with a new data destination, so we don't need to use temporary registers - making them more RISC-like.
The downside is they're EVEX encoded, which adds a 4-byte prefix to the instruction. It's still cheaper to use `lea` for an addition, but now we will be able to do things like
https://www.intel.com/content/www/us/en/developer/articles/t...This guy is tricking us into learning assembly! Get 'em!!
My nefarious plan has been exposed!!
I hope you have a moustache you can twiddle while saying this. Possibly followed by a "Nyah!"
Loving this series! I'm currently implementing a z80 emulator (gameboy) and it's my first real introduction to CISC, and is really pushing my assembly / machine code skills - so having these blog posts coming from the "other direction" are really interesting and give me some good context.
I've implemented toy languages and bytecode compilers/vms before but seeing it from a professional perspective is just fascinating.
That being said it was totally unexpected to find out we can use "addresses" for addition on x86.
A seasoned C programmer knows that "&arr[index]" is really just "arr + index" :) So in a sense, the optimizer rewrote "x + y" into "(int)&(((char*)x)[y])", which looks scarier in C, I admit.
The horrifying side effect of this is that "arr[idx]" is equal to "idx[arr]", so "5[arr]" is just as valid as "arr[5]".
Your colleagues would probably prefer if you forget this.
Mom, please come pick me up. These kids are scaring me.
> so "5[arr]" is just as valid as "arr[5]"
This is, I am sure, one of the stupid legacy reasons we still write "lr a0, 4(a1)" instead of more sensible "lr a0, a1[4]". The other one is that FORTRAN used round parentheses for both array access and function calls, so it stuck somehow.
Generally such constant offsets are record fields in intent, not array indices. (If they were array indices, they'd need to be variable offsets obtained from a register, not immediate constants.) It's reasonable to think of record fields as functions:
To avoid writing out all the field offsets by hand, ARM's old assembler and I think MASM come with a record-layout-definition thing built in, but gas's macro system is powerful enough to implement it without having it built into the assembler itself. It takes about 13 lines of code: http://canonical.org/~kragen/sw/dev3/mapfield.SAlternatively, on non-RISC architectures, where the immediate constant isn't constrained to a few bits, it can be the address of an array, and the (possibly scaled) register is an index into it. So you might have startindex(,%rdi,4) for the %rdi'th start index:
If the PDP-11 assembler syntax had been defined to be similar to C or Pascal rather than Fortran or BASIC we would, as you say, have used startindex[%rdi,4].This is not very popular nowadays both because it isn't RISC-compatible and because it isn't reentrant. AMD64 in particular is a kind of peculiar compromise—the immediate "offset" for startindex and endindex is 32 bits, even though the address space is 64 bits, so you could conceivably make this code fail to link by placing your data segment in the wrong place.
(Despite stupid factionalist stuff, I think I come down on the side of preferring the Intel syntax over the AT&T syntax.)
Yes, I find this one of the weird things about assembly - appending (or pretending?) a number means addition?! - even after many many years of occasionally reading/writing assembly, I’m never completely sure what these instructions do so I infer from context.
That depends on sizeof(*arr) no?
Not in C no, since arithmetic on a pointer is implicitly scaled by the size of the value being pointed at (this statement is kind of breaking the abstraction ... oh well).
Nope, a[b] is equivalent to *(a + b) regardless of a and b.
Given that, why don't we use just `*(a + b)` everywhere?
Wouldn't that be more verbose and less confusing? (genuinely asking)
Do you really think that `*(a + i)` is clearer than `a[i]`?
Not necessarily. I think it's confusing when there are two fairly close ways to express the same thing.
As a side note of appreciation, I think that we can't do better than what he did for being transparent that LLM was used but still just for the proof-reading.
Agreed that it's nice he acknowledged it, but proof reading is about as innocuous of a task for LLMs as they come. Because you actually wrote the content and know its meaning (or at least intended meaning), you can instantly tell when to discard anything irrelevant from the LLM. At worst, it's no better than just skipping that review step.
It does make me curious about the what the super anti-ai people will do.
Matt Godbolt is, obviously, extremely smart and has a lot of interesting insight as a domain expert. But... this was LLM-assisted.
So, anyone who has previously said they'll never (knowingly) read anything that an ai has touched (or similar sentiment) are you going to skip this series? Make an exception?
I think most people wouldn't call proof-reading 'assistance'. As in, if I ask a colleague to review my PR, I wouldn't say he assisted me.
I've been throwing my PR diffs at Claude over the last few weeks. It spits a lot of useless or straight up wrong stuff, but sometimes among the insanity it manages to get one or another typo that a human missed, and between letting a bug pass or spending extra 10m per PR going through the nothingburguers Claude throws at me, I'd rather lose the 10m.
> what the super anti-ai people will do.
Just not use it. I couldn't care less if other people spend hours prompt engineering to get something that approaches useful output. If they want their reputation staked on it's output that's on them. The results are already in and they're not pretty.
I just personally think it's absurd to spend trillions of dollars and watts to create an advanced spell checker. Even more so to see this as a "revolution" of any sort or to not expect a new AI-winter once this bubble pops.
The question was about reading stuff ai touched, not using ai stuff. But I'm glad you could get that off your chest!
This triggers a vague memory of trying to figure out why my assembler (masm?) was outputting a LEA instead of a MOV. I can't remember why. Maybe LEA was more efficient, or MOV didn't really support the addressing mode and the assembler just quietly fixed it for you.
In any case, I felt slightly betrayed by the assembler for silently outputting something I didn't tell it to.
LEA and MOV are doing different things. LEA is just calculating the effective address, but MOV calculates the address then retrieves the value stored at that address.
e.g. If base + (index * scale) + offset = 42, and the value at address 42 is 3, then:
LEA rax, [base + index * scale + offset] will set rax = 42
MOV rax, [base + index * scale + offset] will set rax = 3
I assumed they're referring to register-register moves?
OK, so:
LEA eax, [ebx]
instead of:
MOV eax, ebx
But of course:
MOV eax, [ebx]
is not the same.
The text mentions that it can also do multiplication but doesn't expand on that.
E.g. for x * 5 gcc issues lea eax, [rdi+rdi*4].
It also says the multiplier must be one of 2, 4 or 8.
So I guess this trick then only works for multiplication by 2, 3, 4, 5, 8 or 9?
The tricks to avoid multiplication (and division) are probably worth a whole post.
But with -Os you get imul eax, edi, 6And on modern CPUs multiplication might not be actually all that slow (but there may be fewer multiply units).
Hey now; let's not get ahead too far :) I'm trying to keep each one bite-sized...I don't think you'll be (too) disappointed at the next few episodes :)
>However, in this case it doesn’t matter; those top bits5 are discarded when the result is written to the 32-bit eax.
>Those top bits should be zero, as the ABI requires it: the compiler relies on this here. Try editing the example above to pass and return longs to compare.
Sorry, I don't understand. How could the compiler both discard the top bits, and also rely on the top bits being zero? If it's discarding the top bits, it won't matter whether the top bits are zero or not, so it's not relying on that.
He's actually wrong on the ABI requiring the top bits to be 0. It only requires that the bottom 32 bits match the parameter, but the top bits of a 32-bit parameter passed in a 64-bit register can be anything (at least on Linux).
You can see that in this godbolt example: https://godbolt.org/z/M1ze74Gh6
The reason the code in his post works is because the upper 32 bits of the parameters going into an addition can't affect the low 32 bits of the result, and he's only storing the low 32 bits.
The LLVM x86-64 ABI requires the top bits to be zero. GCC treats them as undefined. Until a recent clarification, the x86-64 psABI made the upper bits undefined by omission only, which is why I think most people followed the GCC interpretation.
https://github.com/llvm/llvm-project/issues/12579 https://groups.google.com/g/x86-64-abi/c/h7FFh30oS3s/m/Gksan... https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/61
GCC is the one defining the effective ABI here so LLVM was always buggy no matter what the spec said / didn't say.
Actually not, the ABI is a cross vendor initiative.
In theory. In practice the vast majority of Linux userland programs are compiled with GCC so unless GCC did something particularly braindead they are unlikely to break compatibility with that and so it's the ABI everyone needs to target. Which is also what happened in this case: The standard was updated to mandate the GCC behavior.
Ahhh! Thanks: that helps me understand where I picked up my misinformation!
There is something fun about using godbolt.org to say that Matt Godbolt is wrong.
(Almost) any instruction on x64 that writes to a 32-bit register as destination, writes the lower 32-bits of the value into the lower 32 bits of the full 64-bit register and zeroes out the upper 32 bits of the full register. He touched on it in his previous note "why xor eax, eax".
But the funny thing is, the x64-specific supplement for SysV ABI doesn't actually specify whether the top bits should be zeroes or not (and so, if the compiler could rely on e.g. function returning ints to have upper 32 bits zeroes, or those could be garbage), and historically GCC and Clang diverged in their behaviour.
> However, in this case it doesn’t matter; those top bits are discarded when the result is written to the 32-bit eax.
Fun (but useless) fact: This being x86, of course there are at least three different ways [1] to encode this instruction: the way it was shown, with an address size override prefix (giving `lea eax, [edi+esi]`), or with both a REX prefix and an address size override prefix (giving `lea rax, [edi+esi]`).
And if you have a segment with base=0 around you can also add in a segment for fun: `lea rax, cs:[edi+esi]`
[1]: not counting redundant prefixes and different ModRMs
It's still wild to me that "Godbolt" is an actual surname.
Someone had a very talented archer as an ancestor.
Part of the Advent of Compiler Optimisations https://xania.org/AoCO2025
Loving it so far!
LEA is a beautiful example of instruction reuse. Designed for pointer arithmetic, repurposed for efficient addition. It's a reminder that good ISA design leaves room for creative optimization - and that compilers can find patterns human assembly programmers might miss.
Human assembly programmers on the 8086 used LEA all the fucking time. And I'm not sure good ISA design is characterized by the need for ingenious hacks to get the best mileage out of the hardware; rather the opposite, in my view. The ARM2's ISA design is head and shoulders better than the 8086's.
This trick is something we teach our students when we do 6809 assembly (mainly as a trick to do addition on the index registers). I had no idea it was used as an optimisation in x86.
> Yesterday we saw how compilers zero registers efficiently.
It took several tries to understand zero is a verb
Verbing weirds language.
Tell me about it...someone turned my name into a verb...
Zero-suffixing does weird language.
It might have been a little bit clearer to say:
Or better still:Often I use "zeroize" rather than "zero" to avoid such confusion.
What's the current best resources to learn assembly? So that I can understand output of simple functions. I don't want to learn to write it properly, I just want to be able to understand on what's happening.
https://godbolt.org/
You can select the assembly output (I like RISCV but you can pick ARM, x86, mips, etc with your choice of compiler) and write your own simple functions. Then put the original function and the assembly output into an LLM prompt window and ask for a line-by-line explanation.
Also very useful to get a copy of Computer Organization and Design RISC-V Edition: The Hardware Software Interface, by Patterson and Hennessy.
Honestly, x86 is not nearly as CISC as those go. It just has a somewhat developed addressing modes comparing to the utterly anemic "register plus constant offset" one, and you are allowed to fold some load-arithmetic-store combinations into a single instruction. But that's it, no double- or triple-indexing or anything like what VAXen had.
And all it really takes to support this is just adding a second (smaller) ALU on your chip to do addressing calculations.One of my biggest bugbears in CS instruction is the overdue emphasis on RISC v CISC, especially as there aren't any really good models to show you what the differences are, given the winnowing of ISAs. In John Mashey's infamous posts [1] sort of delineating an ordered list from most RISCy to most CISCy, the architectures that are the most successful have been the ones that really crowded the RISC/CISC line--ARM and x86.
It also doesn't help that, since x86 is the main goto example for CISC, people end up not having a strong grasp on what features of x86 make it actually CISC. A lot of people go straight to its prefix encoding structure or its ModR/M encoding structure, but honestly, the latter is pretty much just a "compressed encoding" of RISC-like semantics, and the former is far less insane than most people give it credit for. But x86 does have a few weird, decidedly-CISC instruction semantics in it--these are the string instructions like REP MOVSB. Honestly, take out about a dozen instructions, and you could make a solid argument that modern x86 is a RISC architecture!
[1] https://yarchive.net/comp/risc_definition.html
You may enjoy the RISC deprogrammer: https://blog.erratasec.com/2022/10/the-risc-deprogrammer.htm...
I fully agree, but:
> these are the string instructions like REP MOVSB
AArch64 nowadays has somewhat similar CPY* and SET* instructions. Does that make AArch64 CISC? :-) (Maybe REP SCASB/CMPSB/LODSB (the latter being particularly useless) is a better example.)
There's also a lot of specialized instructions like AES ones.
But the main thing that makes x86 CISC to me is not the actual instruction set, but the byte encoding, and the complexity there.
The classic distinction is that a CISC has data processing instructions with memory operands, and in a RISC they only take register parameters. This gets fuzzy though when you look at AArch64 atomic instructions like ldadd which do read-modify-write all in a single instruction.
That's more "load store architecture" than RISC. And by that measure, S/360 could be considered a RISC.
Eh, that's really just a side effect of almost 50 years of constant evolution from a 8-bit microprocessor. Take look at VAX [0], for instance: its instruction encoding is pretty clean yet it's an actual example of a CISC ISA that was impossible to speed up like, literally: DEC engineers tried very hard and concluded that making a truly pipelined & super-scalar implementation was basically impossible; so DEC had to move to Alpha. See [1] for more from John Mashey.
Edit: the very, very compressed TL;DR is that if you do only one memory load (or one memory load + store back into this exact location) per instruction, it scales fine. But the moment you start doing chained loads, with pre- and post-increments which are supposed to write back changed values into the memory and be visible, and you have several memory sources, and your memory model is actually "strong consistency", well, you're in a world of pain.
[0] https://minnie.tuhs.org/CompArch/Resources/webext3.pdf
[1] https://yarchive.net/comp/vax.html
Would this matter for performance? You already have so many execution units that are actually difficult to keep fully fed even when decoding instructions and data at the speed of cache.
Yes. As Joker_vD hints on a sibling comment, this is what killed all the classic CISCs during the OoO transition except for x86 that lacks the more complex addressing modes (and the PPro was still considered a marvel of engineering that was assumed not to be possible).
Do we really know that LEA is using the hardware memory address computation units? What if the CPU frontend just redirects it to the standard integer add units/execution ports? What if the hardware memory address units use those too?
It would be weird to have 2 sets of different adders.
The modern Intel/AMD CPUs have distinct ALUs (arithmetic-logic units, where additions and other integer operations are done; usually between 4 ALUs and 8 ALUs in recent CPUs) and AGUs (address generation units, where the complex addressing modes used in load/store/LEA are computed; usually 3 to 5 AGUs in recent CPUs).
Modern CPUs can execute up to between 6 and 10 instructions within a clock cycle, and up to between 3 and 5 of those may be load and store instructions.
So they have a set of execution units that allow the concurrent execution of a typical mix of instructions. Because a large fraction of the instructions generate load or store micro-operations, there are dedicated units for address computation, to not interfere with other concurrent operations.
https://news.ycombinator.com/item?id=23514072 and https://news.ycombinator.com/item?id=12354494 seem to contradict this and claim that modern intel processors don't use separate AGU for LEA...
Not too versed here, but given that ADD seems to have more execution ports to pick from (e.g. on Skylake), I'm not sure that's an argument in favor of lea. I'd guess that LEA not touching flags and consuming fewer uops (comparing a single simple LEA to 2 ADDs) might be better for out of order execution though (no dependencies, friendlier to reorder buffer)
But can the frontend direct these computations based on what's available? If it sees 10 LEA instructions in a row, and it has 5 AGU units, can it dispatch 5 of those LEA instructions to other ALUs?
Or is it guaranteed that a LEA instruction will always execute on an AGU, and an ADD instruction always on an ALU?
This can vary from CPU model to CPU model.
No recent Intel/AMD CPU executes directly LEA or other instructions, they are decoded into 1 or more micro-operations.
The LEA instructions are typically decoded into either 1 or 2 micro-operations. The addressing modes that add 3 components are usually decoded into 2 micro-operations, like also the obsolete 16-bit addressing modes.
The AGUs probably have some special forwarding paths for the results towards the load/store units, which do not exist in ALUs. So it is likely that 1 of the up to 2 LEA micro-operations are executed only in AGUs. On the other hand, when there are 2 micro-operations it is likely that 1 of them can be executed in any ALU. It is also possible for the micro-operations generated by a LEA to be different from those of actual load/store instructions, so that they may also be executed in ALUs. This is decided by the CPU designer and it would not be surprising if LEAs are processed differently in various CPU models.
> It would be weird to have 2 sets of different adders.
Not really. CPUs often have limited address math available separately from the ALU. On simple cores, it looks like a separate incrementer for the Program Counter, on x86 you have a lot of addressing modes that need a little bit of math; having address units for these kinds of things allows more effective pipelining.
> Do we really know that LEA is using the hardware memory address computation units?
There are ways to confirm. You need an instruction stream that fully loads the ALUs, without fully loading dispatch/commit, so that ALU throughput is the limit on your loop; then if you add an LEA into that instruction stream, it shouldn't increase the cycle count because you're still bottlenecked on ALU throughput and the LEA does address math separately.
You might be able to determine if LEAs can be dispatched to the general purpose ALUs if your instruction stream is something like all LEAs... if the throughput is higher than what could be managed with only address units, it must also use ALUs. But you may end up bottlenecked on instruction commit rather than math.
The confusing thing about LEA is that the source operands are within a '[]' block which makes it look like a memory access.
I'd love to know why that is.
I think the calculation is also done during instruction decode rather than on the ALU, but I could be wrong about that.
It (LEA) does all the work of a memory access (the address computation part) without actually performing the memory access.
Instead of reading from memory at "computed address value" it returns "computed address value" to you to use elsewhere.
The intent was likely to compute the address values for MOVS/MOVSB/MOVSW/MOVSD/MOVSQ when setting up a REP MOVS (or other repeated string operation). But it turned out they were useful for doing three operand adds as well.
LEA is the equivalent of & in C. It gives you the address of something.
Fun question: what does the last line of this do?
MOV BP,12 LEA AX,[BP] MOV BX,34 LEA AX,BX
I think OP was just making a comment on the asymmetry of the syntax. Brackets [] are usually used to dereference.
Why is this written
instead of justIt's due to the way the instruction is encoded. `lea` would've needed special treatment in syntax to remove the brackets.
In `op reg1, reg2`, the two registers are encoded as 3 bits each the ModRM byte which follows the opcode. Obviously, we can't fit 3 registers in the ModRM byte because it's only 8-bits.
In `op reg1, [reg2 + reg3]`, reg1 is encoded in the ModRM byte. The 3 bits that were previously used for reg2 are instead `0b100`, which indicates a SIB byte follows the ModRM byte. The SIB (Scale-Index-Base) byte uses 3 bits each for reg2 and reg3 as the base and index registers.
In any other instruction, the SIB byte is used for addressing, so syntax of `lea` is consistent with the way it is encoded.
Encoding details of ModRM/SIB are in Volume2, Section 2.1.5 of the ISA manual: https://www.intel.com/content/www/us/en/developer/articles/t...
When you encode an x86 instruction, your operands amount to either a register name, a memory operand, or an immediate (of several slightly different flavors). I'm no great connoisseur of ISAs, but I believe this basic trichotomy is fairly universal for ISAs. The operands of an LEA instruction are the destination register and a memory operand [1]. LEA happens to be the unique instruction where the memory operand is not dereferenced in some fashion in the course of execution; it doesn't make a lot of sense to create an entirely new syntax that works only for a single instruction.
[1] On a hardware level, the ModR/M encoding of most x86 instructions allows you to specify a register operand and either a memory or a register operand. The LEA instruction only allows a register and a memory operand to be specified; if you try to use a register and register operand, it is instead decoded as an illegal instruction.
> LEA happens to be the unique instruction where the memory operand is not dereferenced
Not quite unique: the now-deprecated Intel MPX instructions had similar semantics, e.g. BNDCU or BNDMK. BNDLDX/BNDSTX are even weirder as they don't compute the address as specified but treat the index part of the memory operand separately.
The way I rationalize it is that you're getting the address of something. A raw address isn't what you want the address of, so you're doing something like &(*(rdi+rsi)).
Yes, that’s what I meant
LEA stands for Load Effective Address, so the syntax is as-if you're doing a memory access, but you are just getting the calculated address, not reading or writing to that address.
LEA would normally be used for things like calculating address of an array element, or doing pointer math.
[dead]