CS Lessons #001: Working with binary files

This series of articles is about core programming, algorithms and data structures, internet protocols etc. See the introduction and motivation here.

What is a binary file?

Technically every file is a binary file, it is just a stream of arbitrary bytes (values between 0-255) stored on disk. But if you have a file that contains only ASCII characters (values between 0-127), you can say it is a text file.

It can be XML, CSV or some kind configuration file. If you open such file in text editor you can understand what is inside.

On the other hand binary file can be JPEG or ZIP. If you open such file in text editor, you will see a bunch of gibberish. You can still see some ASCII characters, but as a whole it won't make any sense.

To make sense out of it, you need to know particular format specification. This is a document that describes what each byte means. For example, in PNG Specification you can read that first 8 bytes always contain values 137 80 78 71 13 10 26 10. Using this information you can write a program that will check if file is a PNG file. You just need to read the first 8 bytes and compare them with this signature.

Throughout this article I will show you how to work with BMP files. I know that it is an ancient format, but it has sentimental value for me. About 16 years ago I wrote a program to display BMP files in Turbo Pascal.

Reading binary data

First, let's read some data from a file. You can download sample BMP file from here. In Ruby the easiest way to read data is to use File.read method:

data = File.read("lena512.bmp")

Or if you need to do more operations on a file you can first open it and then read data from file object:

data = nil

File.open("lena512.bmp", "r") do |file|
  data = file.read
end

Notice the "r" as a second argument. It tells Ruby to open file only for reading. The default mode for opening file is text mode. To open file in binary mode, you need to pass "rb":

data = nil

File.open("lena512.bmp", "rb") do |file|
  data = file.read
end

With File.read you cannot set binary mode, but you can use File.binread method instead:

data = File.binread("lena512.bmp")

On UNIX like systems there is no distinction between text and binary files, but there is a slight difference in how Ruby handles files in binary and text mode:

data = File.read("lena512.bmp")
data.encoding
=> #<Encoding:UTF-8>

data = File.binread("lena512.bmp")
data.encoding
=> #<Encoding:ASCII-8BIT>

The main difference is in encoding of read data. In binary mode, you will get data encoded in ASCII-8BIT, which is a format for representing byte strings. You want to use this encoding when working with binary data. For example you can easily store it into the database. With UTF-8 encoding you will get errors about incompatible characters.

You can also convert binary data to correct encoding:

data = File.read("lena512.bmp")
data.force_encoding("ASCII-8BIT")
data.encoding
=> #<Encoding:ASCII-8BIT>

Decoding binary data

In Ruby there is a method String#unpack which you can use to decode data from binary string (or just normal string). I must admit that I did many ruby projects but only recently I read documentation for this method and it turns to be pretty simple. It takes one argument, a template string, that describes how to decode binary data. You can choose to decode 1, 2, 4 or 8 bytes integers, you can choose if you want them as signed or unsigned integer or in Little or Big-Endian format. I will explain those topics later on. Here are some examples:

# This binary string contains a date, day is encoded as 1 byte, month as 1 byte
# and year as 1 byte
# This is very short, but still it is a format specification
data = "\x14\a\xB1\a"
data.unpack("CCS")
=> [20, 7, 1969]

As you can see date was decoded correctly. C passed as argument to String#unpack means 1-byte unsigned integer and S means 2-byte unsigned integer. You can find complete documentation for String#unpack here. Now let's try to decode the same data with different format:

data = "\x14\a\xB1\a"
data.unpack("L")
=> [129042196]

It is just some number. L means 4-byte unsigned integer. You can see that knowing binary format specification is necessary to get correct values from it.

Reading BMP file

Now it is time to finally parse data from more complex binary file into something useful. Take a look at BMP Specification.

It says that BMP file contains four parts:

file header
image header
color table
pixel data.

File header is always 14 bytes long and has those 5 fields:

bfType, 2 bytes, BMP file signature "BM"
bfSize, 4 bytes, total size of file
bfReserved1, 2 bytes, unused, must be 0
bfReserved2, 2 bytes, unused, must be 0
bfOffBits, 4 bytes, offset to pixel data

Image header is more complicated. There are actually 7 different versions of it, depending on format version and operating system. They have completely different sizes, so you can check which one you should use. In this article I will cover only BMP files with 256 colors and image header of size 40 bytes. It has following fields:

biSize, 4 bytes, header size, must be 40
biWidth, 4 bytes, image width in pixels
biHeight, 4 bytes, image height in pixels
biPlanes, 2 bytes, must be 1
biBitCount, 2 bytes, bits per pixel
biCompression, 4 bytes, compression type
biSizeImage, 4 bytes, image size
biXPelsPerMeter, 4 bytes, prefered resolution per meter
biYPelsPerMeter, 4 bytes, prefered resolution per meter
biClrUsed, 4 bytes, number of colors used
biClrImportant, 4 bytes, number of important colors

Color table is a definition of colors used in the image. In 256-color files, the size of color table is 1024 bytes, each color is described by 4 bytes. First is the blue value, then green and then red. Fourth byte is unused and equals 0.

After color table there are pixel data. Each pixel is just 1 byte and it is an index in the color table.

require "pp"

# define file header structure
FileHeader = Struct.new(
  :bfType,
  :bfSize,
  :bfReserved1,
  :bfReserved2,
  :bfOffbits
)

# define image header structure
ImageHeader = Struct.new(
  :biSize,
  :biWidth,
  :biHeight,
  :biPlanes,
  :biBitCount,
  :biCompression,
  :biSizeImage,
  :biXPelsPerMeter,
  :biYPelsPerMeter,
  :biClrUsed,
  :biClrImportant
)

File.open("lena512.bmp", "rb") do |file|
  # read 14 bytes, this is the size of file header
  binary = file.read(14)

  # decode binary data
  # A2 - arbitrary string, 2 is there because there are 2 bytes, "BM"
  # L - this is bfSize, 4 bytes unsigned
  # S - bfReserved1, 2 bytes unsigned
  # S - bfReserved2, 2 bytes unsigned
  # L - bfOffBites, 4 bytes unsigned
  data = binary.unpack("A2 L S S L")
  file_header = FileHeader.new(*data)

  # read 40 bytes, this is the size of image header
  binary = file.read(40)

  # decode binary data
  # L - biSize, 4 bytes unsigned
  # L - biWidth, 4 bytes unsigned
  # L - biHeight, 4 bytes unsigned
  # S - biPlanes, 2 bytes unsigned
  # S - biBitCount, 2 bytes unsigned
  # L - biCompression, 4 bytes unsigned
  # L - biSizeImage, 4 bytes unsigned
  # L - biXPelsPerMeter, 4 bytes unsigned
  # L - biYPelsPerMeter, 4 bytes unsigned
  # L - biClrUsed, 4 bytes unsigned
  # L - biClrImportant, 4 bytes unsigned
  data = binary.unpack("L L L S S L L L L L L")
  image_header = ImageHeader.new(*data)

  pp file_header
  pp image_header
end

Output from this program should be something like this:

#<struct FileHeader
  bfType="BM",
  bfSize=263222,
  bfReserved1=0,
  bfReserved2=0,
  bfOffbits=1078>
#<struct ImageHeader
  biSize=40,
  biWidth=512,
  biHeight=512,
  biPlanes=1,
  biBitCount=8,
  biCompression=0,
  biSizeImage=262144,
  biXPelsPerMeter=0,
  biYPelsPerMeter=0,
  biClrUsed=256,
  biClrImportant=0>

From the file header you can see that total file size is 263222 bytes and pixel data offset is 1078 bytes. It makes sense because file header is 14 bytes, image header 40 bytes and color table is 1024 bytes, 14 + 40 + 1024 == 1078.

From image header you know that the image is 512x512, there are 8 bits per pixel and total image data size is 262144 bytes, 263222 - 1078 == 262144.

Cursor position, seek and rewind

Sometimes you may want to read data from the middle of file or from the end. You can use File#seek method. Let's read only color table from BMP file:

File.open("lena512.bmp", "rb") do |file|
  # First 54 bytes are file header + image header, so we want to skip that
  file.seek(54)

  # Read color table, which is 1024 bytes
  color_table = file.read(1024)
end

Encoding data as binary

To save something into a binary file first you need to encode it as binary stream. To do this you need to use Array#pack method, which is an exact opposite of String#unpack and takes the same template string as argument.

Let's say that I want to encode array of integer using 2-byte unsigned integers and at the beginning I want to put number of elements in this array. Number of elements will be encoded as 1-byte unsigned integer:

# First element 8 is number of elements that I want to store
input = [8, 557, 912, 818, 376, 887, 148, 725, 366]

# encode input array as binary string, S* means to repeat the same encoding
# until there is more data
data = input.pack("CS*")
=> "\b-\x02\x90\x032\x03x\x01w\x03\x94\x00\xD5\x02n\x01"

Writing data to file

It is similar to reading. You need to remember about setting binary mode or just use binwrite method:

data = "\b-\x02\x90\x032\x03x\x01w\x03\x94\x00\xD5\x02n\x01"

File.binwrite("data.bin", data)

# or with opening file

File.open("data.bin", "wb") do |file|
  file.write(data)
end

Little vs Big Endian

When you start working with binary data, you will quickly find out about concept of endianness. This is about how bytes are ordered in a stream of binary data.

For example value of 1024 requires 2 bytes to represent it and you can write it as 00000100 00000000 in binary format or as 04 00 in hexadecimal format. The part on the left, 00000100 in binary format or 04 in hexadecimal format is more significant than the part on the right because it represents bigger value. In computer science it is called the most significant byte. The part on right is called the least significant byte. Generally bytes on the left are more significant than those on the right side. This is the same with decimal representation. In value 1024, 1 is the most significant digit because it represents value of 1000 and 4 is least significant digit because it represents value of 4. There is also a concept of most and least significant bit and it means that the bit on the left side within a byte is the most significant and the bit of the right side is the least significant.

Endianness says in what order more significant bytes are stored. Big-Endian means that most significant byte comes first and Little-Endian that least significant byte comes first. The value of 1024 will be stored as 04 00 in Big-Endian format and as 00 04 in Little-Endian format. Both formats are widely used. Big-Endian is more natural for people because it is the same as how we perceive numbers in decimal format. It is also very common in data networking. Little-Endian however, is popular format for storing data in microprocessors. I guess it was easier to design a microprocessor for this format. You can read more about on a linked wikipedia page.

Let's play a little bit with 1024 value:

# S means encode as a 2-byte unsigned integer
[1024].pack("S")
=> "\x00\x04"

Looking at the output you can see that this is Little-Endian format. Least significant byte 00 comes first and then most significant byte 04. You can of course encode the same value as Big-Endian by passing additional information to pack method:

# > means encode as Big-Endian
[1024].pack("S>")
=> "\x04\x00"

Important thing is that you need to know in which format data is encoded and to use the same format when decoding. If you mix it up you will get incorrect results:

# Here you will get incorrect result
[1024].pack("S>").unpack("S")
=> [4]

# Correct result
[1024].pack("S>").unpack("S>")
=> [1024]

Signed vs Unsigned Integers

This is another thing about encoding values as binary data. Unsigned integers can represent positive values or 0, so 1 byte can represent values from 0 to 255, 2 bytes can represent values from 0 to 65536 etc. Signed integers can also represent negative values. To do that the most significant bit is used to store information about the sign. This leaves us with one less bit to encode the actual value, so 1 byte can now represent values from -128 to 127 and 2 bytes can represent values from -32,768 to 32,767. Again important thing is to be aware in what format data is encoded and use the same format for decoding. In file format specification there is always information about in what format data is encoded.

Let's look at some examples:

# Encode value as signed 2-byte integer, notice that we are using lowercase s
# for encoding
[-1024].pack("s").unpack("S")
=> [64512]

# Use correct encoding
[-1024].pack("s").unpack("s")
=> [-1024]

General rule with String#unpack is that you use uppercase characters (Q, L, S and C) to indicate unsigned integers and lowercase characters (q, l, s and c) for signed integers.

Reference

← Back to Posts