9 minutes
Wasm binary encoding
Lately I have been working on my own compiler (for fun & glory) targeting WebAssembly. The inner working of a compiler is well documented and there are plenty of production-grade examples to look at (for instance this book and the rustc dev guide), but when it comes down to writing bytes one after the other the only thing you can hold onto is the specification.
Don’t get me wrong, the specification is great, but it may be a little rough at first. I spent quite some time wrapping my head around branching, number encoding, types and sections, so I feel like I should write that down.
And here we are! In this post I will walk you through the bytes of a simple wasm module, one by one, and by the end you’ll be able to tell what any of those do.
The wasm stack machine
Let’s start with the basics: what is WebAssembly (wasm) exactly?
WebAssembly is a set of binary instructions which looks like that with your favorite hex dumper:
00000000: 0061 736d 0100 0000 0107 0160 027f 7f01 .asm.......`....
00000010: 7f03 0201 0007 0701 0361 6464 0000 0a09 .........add....
00000020: 0107 0020 0020 016a 0b ... . .j.
But you may be more familiar with its textual form (wat, standing for WebAssembly Text):
(module
(func $add (param $lhs i32) (param $rhs i32) (result i32)
get_local $lhs
get_local $rhs
i32.add)
(export "add" (func $add))
)
There are tools to go from one to the other, if you are planning to get your hands dirty with wasm I strongly recommend investing in such tooling, it will save you many hours of debugging.
Those instructions are targeting a stack based virtual machine following the WasAssembly specification, there are plenty of such VM out there: the reference interpreter, Wasmtime or the one in your browser to quote a few.
The ‘stack based’ part means that the VM is doing its calculation on a stack, to compute 3 + 5
it first pushes 3 onto the top of the stack, then 5 and finally uses the add
instruction which consume the topmost two values and push the result back on the stack.
i32.const 3
i32.const 5
i32.add
There is another common family of VM called register-based, which as you guess operate on registers. With such an instruction set adding 3 and 5 would be done in a single instruction, if you already have the values somewhere. It may look like this:
(; store the result of register_a + register_b in register_c ;)
i32.add register_c register_a register_b
As you can see overall register based machines tend to use fewer instructions, they also tend to be faster to interpret because they require less instruction dispatch, but the code size tend to be larger (here is a very interesting comparison if you are interested).
But back to WabAssembly, targeting a stack machine is a good thing for us, compiler writers, because it’s generally easier: we don’t have to bother with registers at all.
The module structure
Wasm code is separated into modules, each module contains a given number of sections which in turn have their own layout.
Before digging into the details I have to warn you that we are going to write quite a few bytes by hand, so from now on I may omit the 0x
part before hexadecimal numbers: consider that every number that follows are written in hexadecimal.
A module always starts with the magic number 00 61 73 6d
followed by the version number, 01 at the time of writing, or 01 00 00 00
in little-endian. You may have noticed that the magic number corresponds to the string \0asm
, you can spot it in the hexdump I showed at the beginning.
So let’s write our very first wasm module (in bytes of course):
It’s not very useful for now, we will need to add a few sections to actually do something with our module.
At the time of writing there are 11 predefined sections:
Section name code description
Type 0x01 # Function signature definitions
Import 0x02 # Import declarations
Function 0x03 # Function declarations
Table 0x04 # Tables used by call_indirect
Memory 0x05 # Memory attributes
Global 0x06 # Global declarations
Export 0x07 # Exports declaration
Start 0x08 # Start function declaration, if any
Element 0x09 # Elements declaration
Code 0x0a # Function code
Data 0x0b # Any type of data
That is a lot, I will just walk you through a few of them, you can learn more by reading the spec (or reading other posts ¯\_(ツ)_/¯).
Writing a wasm function
Let’s say we want to put an ‘add’ section in our module, we will need three sections:
- Type: we need to register the type of our function, in this case let’s say
(i32, i32) -> i32
. - Function: once the type is declared, we need to declare a function with that type.
- Code: finally we put the body of the function in the Code section.
This may seems like a lot of work for defining a simple function, but this can be rationalized by the goals of WebAssembly:
- Define a portable, size- and load-time-efficient binary format to serve as a compilation target.
By defining all the types in one place we can avoid duplication, if you have a thousand functions with type i32 -> i32
you just need to declare it once and refer to it in multiple functions declaration, saving precious bytes of bandwidth. Similarly, a note in the specification justify the separation of Function and Code as a way to enable parallel and streaming compilation.
A section is encoded in three parts:
- One byte for its id, e.g.
0x01
for Type and so on. - Its size encoded as a
u32
, more on that later. - The content, this actually depend on the section.
Declaring a type
The Type section is rather simple, it starts with the number of types (a u32
) and then simply encodes each type one after the other. At this point we have no choice but to grab the specification (unfortunately I’m not yet sponsored to promote the spec…).
Have a look at the “Binary Format” chapter, there we learn that the value type i32
is encoded by 0x7f
. The function type is a little more complicated:, it starts with 0x60
followed by a vector of parameters types and another vector for return types. Vectors are simply encoded as their size followed by the actual elements.
A word on integers
Now it’s time to talk about integer encoding in WASM, because we need it to encode sizes of vectors and the section itself. WebAssembly is meant to be compact and thus uses a compressed representation for integers, more precisely the LEB128 encoding (standing for Little Endian Base 128).
The little endian part means that the least significant bytes goes first, while the base 128 part is because we actually only use 7 bits out of 8 in each byte: the eighth bit is either 1 if there are still non-zero bits to comes, 0 otherwise. Thanks to the eighth bit there is no need to add leading zeroes (well, actually trailing zeroes in little endian…). It saves precious bytes of bandwidth.
Actually we will never use integers larger than 128 in this post, but that is the reason why integers only use one byte in the following.
Now we know all we need to encode the type section: let’s put everything together.
To encode the type (i32, i32) -> i32
we need a vector of size 2 containing two times i32
for the arguments and one of size 1 for the return type.
There is only one type in our type sections
Finally, we can compute its size, 7 bytes, and add the section ID 0x01
And we are done with our first section 🎉
Registering a function
Now we need to register a function, are you ready? This is going to be very fast:
We have a single function, and it has the type with index 0 of the Type section we just wrote
The Function section ID is 0x03
, its size is 2
And boom! The Function section is done!
Adding the function body
It is the part where we actually encore the body of our function, inside the Code section.
func $add (param $lhs i32) (param $rhs i32) (result i32)
get_local $lhs
get_local $rhs
i32.add
There is only three instructions to encode: two get_local
and a i32.add
. We need to check the spec here, it says that get_local
is encoded by the byte 0x20
followed by its argument, that is the index of the function argument we want to get back. So get_local $first_argument
is encoded 20 00
while get_local $second_argument
is 20 01
.
The instruction i32.add
doesn’t take any argument (because WebAssembly is a stack machine), it just removes the top two values of the stack and put back the sum, it is encoded by the byte 0x6a
.
The spec also says that the function body must end with a special end
instruction, encoded by 0x0b
So that’s it, the body of our add
function is encoded as:
We also need to specify a vector of locals, but we don’t use any here, so it’s an empty vector encoded by its size (0) with no elements, in other words the single byte 0x00
The body of the first (and only) function is 7 bytes long:
We are done for the body of the section, it is 9 bytes long and the section has an ID of 0x0a
, thus the complete section is:
In case you are wondering, I’m not guessing what the section should look like, I just followed the spec once again: you can find the details in the subsection Module of the chapter Binary Format.
Bonus: exporting a function
At this point we are done, you can put those three section together and you have a working add
function. The function is not exported, however, so you have no way to check if it works… Actually we just need one extra section to play with our handwritten wasm: the Export section, which as you may guess is responsible for telling which function is exported, and under which name.
To declare and export we need three things:
- A name.
- The kind of thing to be exported (it is not limited to function).
- Its index into the corresponding section.
In our case the name is add
, which corresponds to the bytes 61 64 64
, plus the size of the name, that gives us 03 61 64 64
.
We export a function, according to the spec we should prefix its ID by 00
, and because the function ID is also zero, that gives us:
We have a single export, thus we have to prefix the section body by 01
which gives us a total size of 7 bytes. Because the section ID is 0x07
, the whole export section is encoded by:
Putting everything together
And that’s it! If we put each section back to back (plus the wasm module header), we get a valid and (kind of) usable wasm module!
I hope you enjoyed this post and learned a little more about wasm!