Jonathan Petitcolas

Full-Stack Web Developer, Open-Source Contributor, Seasoned Speaker

Parsing binary files in Go

Published on 25 September 2014

I am currently working on a Go library to parse Starcraft2 replay files. These files are stored in binary format (called MoPaQ) where each sequence of bytes is related to a specific information. For instance, the four bytes at position 8 represents the header offset, which is 1024. It means that real game data are going to start at the 1024th byte.

I won’t cover the whole MoPaQ parsing in this post. I will probably write a dedicated article for it when I’ll get satisfactory results. Instead, here is an elegant way Go offers to parse binary files. As a support, you can download one of my replay file.

Browsing the Internet gave me the following truncated data structure for MoPaQ header:

Attribute			Location			Hexadecimal value	Meaning
-----------------------------------------------------------------------------------------------
Format				0x0000 -> 0x0003	4D 50 51 1B			MPQ\x1b (format name)
UserDataMaxSize		0x0004 -> 0x0007	00 02 00 00			512
HeaderOffset		0x0008 -> 0x0011	00 04 00 00			1024
UserDataSize		0x0012 -> 0x0015	3C 00 00 00			60
					0x0016 -> 0x0020	05 08 00 02 2C		?
Starcraft2			0x0021 -> 0x0042	53 74 61 [...]		"Starcraft II Replay 11" in binary

So, let’s start by loading our file:

package main

import (
	"fmt"
	"log"
	"os"
)

func main() {
	path := "data/replay.SC2Replay"

	file, err := os.Open(path)
	if err != nil {
		log.Fatal("Error while opening file", err)
	}

	defer file.Close()

	fmt.Printf("%s opened\n", path)
}

This code is pretty straightforward: we use the os library to manipulate file, and just ensure we did not encounter any error during this process. The main point here is the defer statement: it ensures the file.Close() function is called as soon as we exit current function, whether it fails or succeeds.

Then, let’s ensure it is a valid Starcraft2 replay file checking the format name at the beginning:

func main() {
    // ...
    formatName := readNextBytes(file, 4)
    fmt.Printf("Parsed format: %s\n", formatName)

    if string(formatName) != "MPQ\x1b" {
    	log.Fatal("Provided replay file is not in correct format.")
    }
}

func readNextBytes(file *os.File, number int) []byte {
	bytes := make([]byte, number)

	_, err := file.Read(bytes)
	if err != nil {
		log.Fatal(err)
	}

	return bytes
}

The most interesting part of this code is the readNextBytes function. It takes two parameter: the file pointer and a number of bytes to read. First step, we instanciate a slice of bytes to store our result. In this case, slice is equivalent to an array with a fixed size of number bytes. Then, we put into our slice as many bytes as we can through the file.Read function.

You can notice it returns several arguments (another great feature of Go). The first one is the number of read bytes. As we do not have any use of it, we simply ignore it via the use of an underscore (the equivalent of /dev/null for Go variables). The second returned value is error. If it is not nil, let’s log the error message.

Finally, we check the format casting our bytes array into a string and ensuring it is equal to MPQ\x1b.

Now we get a valid file, we may continue our parsing manually, using the readNextByte function to get every next sequences of bytes. However, there is a far more elegant way, achieved in two simple steps.

First, define a structure to store all the parsed attributes:

type Header struct {
	UserDataMaxSize uint32
	HeaderOffset uint32
	UserDataSize uint32
	_ [5]byte
	Starcraft2 [22]byte
}

According to our investigations, after format comes the UserDataMaxSize, encoded on 4 bytes. As it is an integer, let’s wrap it into an uint32, which is also encoded on 4 bytes. Respecting fields length using correct types is the key point here, as binary parsing is based on structure fields memory allocation.

Notice the attribute _. As previously seen, it means we don’t care about it. And as we got 5 unknown bytes, it is important to pinpoint this offset to not parse it. In this case, we simply move the cursor five bytes forward. We may get several _ if required.

Now, here is the magical glue between our structure and file:

import (
    // ...
    "bytes"
    "encoding/binary"
)

func main() {
    // ...

	header := Header{}
	data := readNextBytes(file, 39) // 4 * uint32 (3) + 5 * byte (1) + 22 * byte (1) = 43

	buffer := bytes.NewBuffer(data)
	err = binary.Read(buffer, binary.LittleEndian, &header)
	if err != nil {
		log.Fatal("binary.Read failed", err)
	}

	fmt.Printf("Parsed data:\n%+v\n", header)
}

We read the first 43 bytes, corresponding to the size of our structure. Then, we instantiate a new bytes buffer we are going to map to our structure through a binary reader. The second argument specifies the Endianess of our data: in other words, the way bytes are stored. With binary.LittleEndian, it means the least significant byte is stored in the smallest address, at the end of a given bytes sequence.

That’s all folks! In simply three lines of code, we were able to parse our binary file. Go is so magic, isn’t it?

Note: the whole code is available as a Gist.

comments powered by Disqus