The Aarch64 Instruction Format for Code Generation in Compiler Design
A post on writing the Aarch64 ISA for a small compiler!
What’s up chat! Recently, I have been learning some ARM64 instruction encoding for my own compiler backend on code generation, so I thought I should share the knowledge I learned here. To begin with, I spent some of my time on this topic in order to create my own compiler backend that doesn’t have to rely on LLVM as using it for such a small project like a Brainfuck compiler that requires only a fragment of the whole AArch64 ISA, which is not really appropriate (overkill might be the word I’m looking for).
Aarch64 Instruction Format Overview
Every ARM64 instruction is a 32-bit (or 4-bytes) word which makes them relatively easy for programmers to handwrite them, opposed to CISCs in general. Despite having fixed-width encoding format, it still offers us a diverse range and flexible instruction set.
Of course, the exact bit layout varies depending on the instruction type. ARM64 instructions are organized into several major groups, each with its own encoding pattern. First is immediate. Second is branches and exception generation. Third is load and stores. Fourth is register operation and the last is SIMD and floating point operations.
You should be looking at the ARM developer documentation or here to understand what I am talking about.
From IR to Machine Code
Typically when building an optimizing compiler, you have an intermediate representation that needs to get translated and optionally optimized into machine code. For ARM64, this means mapping your IR operations to one of the chosen instruction encodings. A simple pipeline example would be as follows:
1
2
3
source: x = y + z
IR: add %x, %y, %z
ARM64: 08B010020 (add x0, x1, x2)
Now, let’s analyze some real encoding functions. The first encoder we are going to look at is the ADD instruction in register form.
Encoding Basic Arithmetic
The ADD instructions, both register and immediate are fairly simple, with a specific byte that you have to adjust to use one another. Let’s look at the register version first.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
uint32_t encode_add_reg(const int rd, const int rn, const int rm, const uint32_t shift_type,
const uint32_t shift_amount, const bool is64bit, const bool setflags)
{
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
constexpr uint32_t op = 0; /* ADD operation (0) */
const uint32_t S = setflags ? 1 : 0; /* set flags if true */
/* shift type: 00=LSL, 01=LSR, 10=ASR, 11=Reserved */
/* shift amount: 6 bits (0-63) */
if (is64bit == false && shift_amount >= 32)
/* for 32-bit registers, shift amount must be < 32 */
return 0; /* error condition */
return (sf << 31) | /* size flag */
(op << 30) | /* operation: 0 for ADD, 1 for SUB */
(S << 29) | /* set flags */
(0b01011 << 24) | /* opcode fixed pattern */
(shift_type << 22) | /* shift type */
(0 << 21) | /* reserved bit */
(rm << 16) | /* second source register */
((shift_amount & 0x3F) << 10) | /* shift amount (6 bits) */
(rn << 5) | /* first source register */
rd; /* destination register */
}
This function packs the instruction into a single 32-bit word. What I love about it is how each bit field has a specific meaning. The MSB (bit 31) is the size flag, which determines whether we’re operating on 64-bit or 32-bit registers. Bits 30 and 29 define the operation (ADD vs SUB) and whether to update the condition flags.
The middle bits specify the operation type and operands, while the final 5 bits identify the destination register. Notice how we check if the shift amount is valid for 32-bit registers; these details can be subtle and are easy to miss when you’re just using an assembler.
Next is the immediate, or constant in other words, operations. We use slightly different encoding bits.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
uint32_t encode_add_imm(const int rd, const int rn, const int imm12, const bool is64bit, const bool setflags)
{
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
const uint32_t op = 0; /* ADD operation (0) */
const uint32_t S = setflags ? 1 : 0; /* set flags (ADDS) if true */
return (sf << 31) | /* size flag */
(op << 30) | /* operation: 0 for ADD, 1 for SUB */
(S << 29) | /* set flags */
(0b10001 << 24) | /* opcode fixed pattern */
(0 << 22) | /* shift amount */
((imm12 & 0xFFF) << 10) | /* 12-bit immediate */
(rn << 5) | /* first source register */
rd; /* destination register */
}
As you may notice that the only difference is the opcode in bits 28-24; 0b01011
for register but 0b10001
for immediate operations. Also, instead of a source register and shift, we have a 12-bit immediate value.
SUB
operations are similar as well, with the delta being opcode. You should be noticing this kind of pattern during this blog as they are quite memorable.
Memory Operations
We all know memory access, whether it’s a stack variable or a pointer that points are fundamental to all processors known to humans. ARM64 has a rich extension for this category, however, we will be covering only 2 of them which is load and store immediate.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
uint32_t encode_ldr_imm(const int rt, const int rn, const int imm12, const uint32_t size)
{
/* size: 0=byte, 1=halfword, 2=word, 3=doubleword */
uint32_t opc = 1; /* LDR=1 for the opc field */
/* scale the immediate by the data size (byte=1, halfword=2, etc.) */
uint32_t scaled_imm = imm12 << size;
/* check for valid immediate range (after scaling) */
uint32_t max_imm = 0xFFF << size;
if (scaled_imm > max_imm)
return 0; /* error: immediate too large after scaling */
return (size << 30) | /* size field */
(0b111001 << 24) | /* opcode fixed pattern */
(opc << 22) | /* operation: 1 for LDR */
(imm12 << 10) | /* 12-bit immediate */
(rn << 5) | /* base register */
rt; /* target register */
}
What’s interesting here is how the immediate offset gets scaled based on the data size. If you’re loading a byte (size=0), the offset isn’t scaled. For halfwords (size=1), it’s multiplied by 2, for words by 4, and for doublewords by 8. The compiler can encode larger offsets for larger data types without any extra bits needed!
Store operations are also similar:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
uint32_t encode_str_imm(const int rt, const int rn, const int imm12, const uint32_t size)
{
/* size: 0=byte, 1=halfword, 2=word, 3=doubleword */
constexpr uint32_t opc = 0; /* STR=0 for the opc field */
/* scale the immediate by the data size (byte=1, halfword=2, etc.) */
const uint32_t scaled_imm = imm12 << size;
/* check for valid immediate range (after scaling) */
if (const uint32_t max_imm = 0xFFF << size;
scaled_imm > max_imm)
return 0; /* error: immediate too large after scaling */
return (size << 30) | /* size field */
(0b111001 << 24) | /* opcode fixed pattern */
(opc << 22) | /* operation: 0 for STR */
(imm12 << 10) | /* 12-bit immediate */
(rn << 5) | /* base register */
rt; /* source register */
}
Control Flow
ARM64 provides several branch instructions, but the most commonly seen and used is BL
which is used for function calls.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
uint32_t encode_bl(const int64_t offset)
{
/* BL has a 26-bit signed immediate offset field, scaled by 4 */
/* range is ±128MB (±2²⁶ bytes) */
if (offset < -0x8000000 || offset > 0x7FFFFFF || (offset & 3) != 0)
return 0; /* error: offset out of range or not aligned to 4 bytes */
constexpr uint32_t op = 1; /* BL=1 */
const uint32_t imm26 = (offset >> 2) & 0x3FFFFFF;
return (op << 31) | /* op bit: 1 for BL */
(0b00101 << 26) | /* opcode fixed pattern */
imm26; /* 26-bit immediate offset divided by 4 */
}
The BL instruction jumps to a target address and stores the return address in the link register X30
. The 26-bit immediate gives a range of ±128MB, which is plenty for most programs. Notice how the offset is divided by 4 (<< 2
for those who didn’t notice) before encoding since all instructions are 4 bytes and addresses must be aligned.
For conditional execution, ARM64 provides Conditional Select CSEL
for us. CSEL
is basically ternary operator, which chooses one out of two registers based on condition code. It is way more efficient than your average assignment conditionals for sure, however, it has a strict requirement.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
uint32_t encode_csel(const int rd, const int rn, const int rm, const uint32_t cond, const bool is64bit)
{
/* CSEL - conditional select */
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
constexpr uint32_t op = 0; /* op = 0 for CSEL */
constexpr uint32_t o2 = 0; /* o2 = 0 for CSEL */
if (cond > 15) /* condition code range check */
return 0;
return (sf << 31) | /* size flag */
(op << 30) | /* operation type */
(0b011010100 << 21) | /* fixed pattern */
(rm << 16) | /* second source register */
(cond << 12) | /* condition code */
(0 << 11) | /* fixed bit */
(o2 << 10) | /* operation variant */
(rn << 5) | /* first source register */
rd; /* destination register */
}
Moving Data Around
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
uint32_t encode_mov_reg(const int rd, const int rm, const bool is64bit)
{
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
constexpr uint32_t opc = 1; /* ORR=1 */
constexpr uint32_t N = 0; /* no inversion */
constexpr uint32_t shift_type = 0; /* LSL */
constexpr uint32_t shift_amount = 0; /* no shift */
constexpr uint32_t rn = 31; /* XZR/WZR (register 31) */
return (sf << 31) | /* size flag */
(opc << 29) | /* operation: 1 for ORR */
(0b01010 << 24) | /* opcode fixed pattern */
(shift_type << 22) | /* shift type (LSL) */
(N << 21) | /* N bit (0) */
(rm << 16) | /* source register */
(shift_amount << 10) | /* shift amount (0) */
(rn << 5) | /* XZR/WZR as first source */
rd; /* destination register */
}
This is self-explanatory (do email me if you think an elaboration is needed!). Some fun facts on this encoding: ARM64 processors just use the bitwise OR with zero register, XZR
/ WZR
as the first source instead of having a dedicated mov
like x86.
For loading immediate values, we have MOVZ which is just:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
uint32_t encode_movz(const int rd, const uint16_t imm16, const uint32_t shift, const bool is64bit)
{
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
constexpr uint32_t opc = 2; /* MOVZ=2 */
/* conv shift (0, 16, 32, 48) to hw field (0, 1, 2, 3) */
const uint32_t hw = shift / 16;
/* 32-bit registers can only use shifts 0 and 16 (hw=0,1) */
if (!is64bit && hw > 1)
return 0; /* error */
return (sf << 31) | /* size flag */
(opc << 29) | /* operation: 2 for MOVZ */
(0b100101 << 23) | /* opcode fixed pattern */
(hw << 21) | /* hw field (shift/16) */
((imm16 & 0xFFFF) << 5) | /* 16-bit immediate */
rd; /* destination register */
}
MOVZ
is part of a trio of instructions along with MOVK
and MOVN
that allow loading arbitrary 64-bit constants. MOVZ
loads a 16-bit value and zeros the rest, while MOVK
keeps the other bits unchanged, and MOVN
loads the bitwise NOT of the immediate. By combining these with different shifts, you can pretty much construct any 64-bit value in like 4 or 5 instructions.
Bitwise Operations
Just like the above, bitwise operations are essential to CPUs. Let’s look at the AND
register:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
uint32_t encode_and_reg(const int rd, const int rn, int rm, const uint32_t shift_type, const uint32_t shift_amount, const bool is64bit, const bool setflags)
{
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
const uint32_t opc = setflags ? 3 : 0; /* AND=0, ANDS=3 */
constexpr uint32_t N = 0; /* invert rm (0 for no inversion) */
/* shift type: 00=LSL, 01=LSR, 10=ASR, 11=ROR */
if (is64bit == false && shift_amount >= 32)
/* for 32-bit registers, shift amount must be < 32 */
return 0; /* error condition */
return (sf << 31) | /* size flag */
(opc << 29) | /* operation: 0 for AND, 3 for ANDS */
(0b01010 << 24) | /* opcode fixed pattern */
(shift_type << 22) | /* shift type */
(N << 21) | /* N bit (0 for AND, 1 for BIC) */
(rm << 16) | /* second source register */
((shift_amount & 0x3F) << 10) | /* shift amount (6 bits) */
(rn << 5) | /* first source register */
rd; /* destination register */
}
Note how the N bit can be used to invert the second source operand, effectively turning AND
into BIC
(bit clear) or OR
into ORN
(or not). Yet another elegant design choice that reduces the number of opcode patterns needed.
Multiplications
The MADD
instruction simply performs a multiply-add operation, D = N * M + A. MUL is actually just MADD
with the addend set to zero just like our earlier case, MOV
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
uint32_t encode_madd(const int rd, const int rn, const int rm, const int ra, const bool is64bit)
{
/* MADD - Multiply-Add: Rd = Rn * Rm + Ra */
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
constexpr uint32_t o0 = 0; /* o0 = 0 for MADD operation */
return (sf << 31) | /* size flag */
(0 << 30) | /* fixed bit */
(0b011011 << 24) | /* fixed pattern */
(0 << 23) | /* fixed bit for non-extended ops */
(rm << 16) | /* second source register (multiplier) */
(o0 << 15) | /* operation type: 0 for MADD */
(ra << 10) | /* third source register (addend) */
(rn << 5) | /* first source register (multiplier) */
rd; /* destination register */
}
Shiftings
1
2
3
4
5
6
7
8
9
10
11
12
13
uint32_t encode_lsrv(const int rd, const int rn, const int rm, const bool is64bit)
{
const uint32_t sf = is64bit ? 1 : 0; /* 64-bit (1) or 32-bit (0) */
constexpr uint32_t op2 = 1; /* LSR=1 */
return (sf << 31) | /* size flag */
(0b0011010110 << 21) | /* opcode fixed pattern */
(rm << 16) | /* shift amount register */
(0b0010 << 12) | /* fixed pattern */
(op2 << 10) | /* operation: 1 for LSR */
(rn << 5) | /* source register */
rd; /* destination register */
}
This instruction shifts the value in register Rn right by the amount specified in the low 8 or 6 bits of Rm for either 32-bit or 64-bit registers, respectively.
SIMD and Floating Point Operations
It can be arguably said that this part is the final boss of instruction encoding, regardless of architecture as these kind of operations are extremely complex. Anyways, enough ranting, let’s look at how FCMP
is encoded.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
uint32_t encode_fcmp(const int rn, const int rm, const uint32_t size)
{
uint32_t ftype;
/* convert size (0=16-bit, 1=32-bit, 2=64-bit) to ftype field */
switch (size)
{
case FP16:
ftype = 3;
break;
case FP32:
ftype = 0;
break;
case FP64:
ftype = 1;
break;
default:
return 0;
}
uint32_t opc = 0; /* opcode for FCMP */
return (0b00011110 << 24) | /* opcode fixed pattern */
(ftype << 22) | /* floating-point type */
(1 << 21) | /* fixed bit */
(rm << 16) | /* second source register */
(0b001000 << 10) | /* fixed pattern for FCMP */
(rn << 5) | /* first source register */
(opc << 3) | /* operation variant */
(0b000); /* fixed bits */
}
The ftype
field specifies whether we’re working with 16-bit (half precision), 32-bit (single precision), or 64-bit (double precision) floating point values.
For SIMD operations, things get even more complex. I’ll just drop a single example because it’s unwieldy long compared to other demonstrated stuffs. This is MOVI
, it loads an immediate value into a vector register.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
uint32_t encode_movi(int rd, uint64_t imm, uint32_t arrangement)
{
uint32_t Q = 0; /* Q=0 for 8B/4H/2S, Q=1 for 16B/8H/4S/2D */
uint32_t op = 0; /* op=0 for MOVI, op=1 for MVNI */
uint32_t cmode, imm_bits;
/* determine Q bit based on arrangement */
switch (arrangement)
{
case SIMD_8B: /* 8B (8 bytes) */
Q = 0;
break;
case SIMD_16B: /* 16B (16 bytes) */
Q = 1;
break;
/* truncated */
default:
return 0; /* error: invalid arrangement */
}
/* determine cmode and encode the immediate */
if (arrangement <= SIMD_16B)
{
/* 8B/16B: 8-bit immediate */
cmode = 0xE;
imm_bits = imm & 0xFF;
}
else if (arrangement <= SIMD_8H)
{
/* truncated */
}
else { /* truncated */ }
/* encode the actual instruction */
return (Q << 30) | /* Q bit for arrangement */
(op << 29) | /* Operation: 0 for MOVI */
(0b0111100000 << 19) | /* opcode fixed pattern */
(((imm_bits >> 5) & 0x7) << 16) | /* bits 7:5 of immediate to immh */
(cmode << 12) | /* cmode field */
(0 << 11) | /* o2=0 for integer variant */
(1 << 10) | /* fixed bit */
((imm_bits & 0x1F) << 5) | /* bits 4:0 of immediate to imml */
rd; /* destination vector register */
}
Each vector arrangement can have entirely different arrangement based on immediate values’ properties e.g. alignments and sizes.
System Calls
Finally, a simple one; a SVC
call. It transfers control from EX0 to EX1, or user mode to kernel mode in a simpler term.
1
2
3
4
5
6
7
8
9
10
uint32_t encode_svc(uint32_t imm16)
{
/* SVC (supervisor call) - triggers exception to switch to EL1 */
if (imm16 > 0xFFFF)
return 0; /* error: immediate out of range */
return (0b11010100000 << 21) | /* opcode fixed pattern */
(imm16 << 5) | /* 16-bit immediate */
(0b00001); /* LL=01 for SVC */
}
Atomics
Last encoding function of today, CAS
! CAS
is used in lock-free programming and thread synchronization. As you can see, the acquire and release flags are for manipulating the ordering semantics. Also, please do note that you may have to explicitly emit this instruction as the ARM architecture doesn’t have strong ordering unlike AMD64.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
uint32_t encode_cas(const int rs, const int rt, const int rn, const uint32_t size, const bool acquire, const bool release)
{
/* size: 2=word, 3=doubleword only - byte & halfword not supported */
/* see: https://developer.arm.com/documentation/ddi0602/2024-12/Base-Instructions/CAS--CASA--CASAL--CASL--Compare-and-swap-word-or-doubleword-in-memory-?lang=en */
if (size != 2 && size != 3)
return 0;
const uint32_t L = acquire ? 1 : 0; /* L=1 for acquire semantics */
const uint32_t o0 = release ? 1 : 0; /* o0=1 for release semantics */
return (size << 30) | /* size field */
(0b0010001 << 23) | /* opcode fixed pattern */
(L << 22) | /* L bit for acquire */
(1 << 21) | /* */
(rs << 16) | /* Rs register (expected value) */
(o0 << 15) | /* o0 bit for release semantics */
(0b11111 << 10) | /* Rt2=11111 for CAS */
(rn << 5) | /* Rn register (memory address) */
rt; /* Rt register (new/result value) */
}
Wrap Up
Encoding ARM64 instructions are more trivial than we thought right? The patterns so memorizable enough that we can vividly remembers how they are encoded; MSB is for size flags etc. For a small compiler project where you only need a subset of the instruction set, writing your own encoder functions is much more straightforward than integrating with a complex framework like LLVM. Plus, it’s such a fantastic way to understand this one specific CPU architecture.
In future posts, I may talk about other architecture instead or yap about differences of code generation algorithms. Please do give feedbacks by either emails or issues. Until then, happy encoding!