CS Lessons #001: Working with binary files
This series of articles is about core programming, algorithms and data structures, internet protocols etc. See the introduction and motivation here.
What is a binary file?
Technically every file is a binary file, it is just a stream of arbitrary bytes (values between
0-255) stored on disk. But if you have a file that contains only ASCII characters (values between
0-127), you can say it is a text file.
It can be XML, CSV or some kind configuration file. If you open such file in text editor you can understand what is inside.
On the other hand binary file can be JPEG or ZIP. If you open such file in text editor, you will see a bunch of gibberish. You can still see some ASCII characters, but as a whole it won't make any sense.
To make sense out of it, you need to know particular format specification. This is a document that describes what each byte means. For example, in PNG Specification you can read that first 8 bytes always contain values
137 80 78 71 13 10 26 10. Using this information you can write a program that will check if file is a PNG file. You just need to read the first 8 bytes and compare them with this signature.
Throughout this article I will show you how to work with BMP files. I know that it is an ancient format, but it has sentimental value for me. About 16 years ago I wrote a program to display BMP files in Turbo Pascal.
Reading binary data
First, let's read some data from a file. You can download sample BMP file from here. In Ruby the easiest way to read data is to use
data = File.read("lena512.bmp")
Or if you need to do more operations on a file you can first open it and then read data from file object:
data = nil File.open("lena512.bmp", "r") do |file| data = file.read end
"r" as a second argument. It tells Ruby to open file only for reading. The default mode for opening file is text mode. To open file in binary mode, you need to pass
data = nil File.open("lena512.bmp", "rb") do |file| data = file.read end
File.read you cannot set binary mode, but you can use
File.binread method instead:
data = File.binread("lena512.bmp")
On UNIX like systems there is no distinction between text and binary files, but there is a slight difference in how Ruby handles files in binary and text mode:
data = File.read("lena512.bmp") data.encoding => #<Encoding:UTF-8> data = File.binread("lena512.bmp") data.encoding => #<Encoding:ASCII-8BIT>
The main difference is in encoding of read data. In binary mode, you will get data encoded in ASCII-8BIT, which is a format for representing byte strings. You want to use this encoding when working with binary data. For example you can easily store it into the database. With UTF-8 encoding you will get errors about incompatible characters.
You can also convert binary data to correct encoding:
data = File.read("lena512.bmp") data.force_encoding("ASCII-8BIT") data.encoding => #<Encoding:ASCII-8BIT>
Decoding binary data
In Ruby there is a method
String#unpack which you can use to decode data from binary string (or just normal string). I must admit that I did many ruby projects but only recently I read documentation for this method and it turns to be pretty simple. It takes one argument, a template string, that describes how to decode binary data. You can choose to decode 1, 2, 4 or 8 bytes integers, you can choose if you want them as signed or unsigned integer or in Little or Big-Endian format. I will explain those topics later on. Here are some examples:
# This binary string contains a date, day is encoded as 1 byte, month as 1 byte # and year as 1 byte # This is very short, but still it is a format specification data = "\x14\a\xB1\a" data.unpack("CCS") => [20, 7, 1969]
As you can see date was decoded correctly.
C passed as argument to
1-byte unsigned integer and
2-byte unsigned integer. You can find complete documentation for
String#unpack here. Now let's try to decode the same data with different format:
data = "\x14\a\xB1\a" data.unpack("L") => 
It is just some number.
4-byte unsigned integer. You can see that knowing binary format specification is necessary to get correct values from it.
Reading BMP file
Now it is time to finally parse data from more complex binary file into something useful. Take a look at BMP Specification.
It says that BMP file contains four parts:
- file header
- image header
- color table
- pixel data.
File header is always 14 bytes long and has those 5 fields:
- bfType, 2 bytes, BMP file signature "BM"
- bfSize, 4 bytes, total size of file
- bfReserved1, 2 bytes, unused, must be 0
- bfReserved2, 2 bytes, unused, must be 0
- bfOffBits, 4 bytes, offset to pixel data
Image header is more complicated. There are actually 7 different versions of it, depending on format version and operating system. They have completely different sizes, so you can check which one you should use. In this article I will cover only BMP files with 256 colors and image header of size 40 bytes. It has following fields:
- biSize, 4 bytes, header size, must be 40
- biWidth, 4 bytes, image width in pixels
- biHeight, 4 bytes, image height in pixels
- biPlanes, 2 bytes, must be 1
- biBitCount, 2 bytes, bits per pixel
- biCompression, 4 bytes, compression type
- biSizeImage, 4 bytes, image size
- biXPelsPerMeter, 4 bytes, prefered resolution per meter
- biYPelsPerMeter, 4 bytes, prefered resolution per meter
- biClrUsed, 4 bytes, number of colors used
- biClrImportant, 4 bytes, number of important colors
Color table is a definition of colors used in the image. In 256-color files, the size of color table is 1024 bytes, each color is described by 4 bytes. First is the blue value, then green and then red. Fourth byte is unused and equals 0.
After color table there are pixel data. Each pixel is just 1 byte and it is an index in the color table.
require "pp" # define file header structure FileHeader = Struct.new( :bfType, :bfSize, :bfReserved1, :bfReserved2, :bfOffbits ) # define image header structure ImageHeader = Struct.new( :biSize, :biWidth, :biHeight, :biPlanes, :biBitCount, :biCompression, :biSizeImage, :biXPelsPerMeter, :biYPelsPerMeter, :biClrUsed, :biClrImportant ) File.open("lena512.bmp", "rb") do |file| # read 14 bytes, this is the size of file header binary = file.read(14) # decode binary data # A2 - arbitrary string, 2 is there because there are 2 bytes, "BM" # L - this is bfSize, 4 bytes unsigned # S - bfReserved1, 2 bytes unsigned # S - bfReserved2, 2 bytes unsigned # L - bfOffBites, 4 bytes unsigned data = binary.unpack("A2 L S S L") file_header = FileHeader.new(*data) # read 40 bytes, this is the size of image header binary = file.read(40) # decode binary data # L - biSize, 4 bytes unsigned # L - biWidth, 4 bytes unsigned # L - biHeight, 4 bytes unsigned # S - biPlanes, 2 bytes unsigned # S - biBitCount, 2 bytes unsigned # L - biCompression, 4 bytes unsigned # L - biSizeImage, 4 bytes unsigned # L - biXPelsPerMeter, 4 bytes unsigned # L - biYPelsPerMeter, 4 bytes unsigned # L - biClrUsed, 4 bytes unsigned # L - biClrImportant, 4 bytes unsigned data = binary.unpack("L L L S S L L L L L L") image_header = ImageHeader.new(*data) pp file_header pp image_header end
Output from this program should be something like this:
#<struct FileHeader bfType="BM", bfSize=263222, bfReserved1=0, bfReserved2=0, bfOffbits=1078> #<struct ImageHeader biSize=40, biWidth=512, biHeight=512, biPlanes=1, biBitCount=8, biCompression=0, biSizeImage=262144, biXPelsPerMeter=0, biYPelsPerMeter=0, biClrUsed=256, biClrImportant=0>
From the file header you can see that total file size is 263222 bytes and pixel data offset is 1078 bytes. It makes sense because file header is 14 bytes, image header 40 bytes and color table is 1024 bytes,
14 + 40 + 1024 == 1078.
From image header you know that the image is 512x512, there are 8 bits per pixel and total image data size is 262144 bytes,
263222 - 1078 == 262144.
Cursor position, seek and rewind
Sometimes you may want to read data from the middle of file or from the end. You can use
File#seek method. Let's read only color table from BMP file:
File.open("lena512.bmp", "rb") do |file| # First 54 bytes are file header + image header, so we want to skip that file.seek(54) # Read color table, which is 1024 bytes color_table = file.read(1024) end
Encoding data as binary
To save something into a binary file first you need to encode it as binary stream. To do this you need to use
Array#pack method, which is an exact opposite of
String#unpack and takes the same template string as argument.
Let's say that I want to encode array of integer using 2-byte unsigned integers and at the beginning I want to put number of elements in this array. Number of elements will be encoded as 1-byte unsigned integer:
# First element 8 is number of elements that I want to store input = [8, 557, 912, 818, 376, 887, 148, 725, 366] # encode input array as binary string, S* means to repeat the same encoding # until there is more data data = input.pack("CS*") => "\b-\x02\x90\x032\x03x\x01w\x03\x94\x00\xD5\x02n\x01"
Writing data to file
It is similar to reading. You need to remember about setting binary mode or just use
data = "\b-\x02\x90\x032\x03x\x01w\x03\x94\x00\xD5\x02n\x01" File.binwrite("data.bin", data) # or with opening file File.open("data.bin", "wb") do |file| file.write(data) end
Little vs Big Endian
When you start working with binary data, you will quickly find out about concept of endianness. This is about how bytes are ordered in a stream of binary data.
For example value of
1024 requires 2 bytes to represent it and you can write it as
00000100 00000000 in binary format or as
04 00 in hexadecimal format. The part on the left,
00000100 in binary format or
04 in hexadecimal format is more significant than the part on the right because it represents bigger value. In computer science it is called the most significant byte. The part on right is called the least significant byte. Generally bytes on the left are more significant than those on the right side. This is the same with decimal representation. In value
1 is the most significant digit because it represents value of
4 is least significant digit because it represents value of
4. There is also a concept of most and least significant bit and it means that the bit on the left side within a byte is the most significant and the bit of the right side is the least significant.
Endianness says in what order more significant bytes are stored. Big-Endian means that most significant byte comes first and Little-Endian that least significant byte comes first. The value of
1024 will be stored as
04 00 in Big-Endian format and as
00 04 in Little-Endian format. Both formats are widely used. Big-Endian is more natural for people because it is the same as how we perceive numbers in decimal format. It is also very common in data networking. Little-Endian however, is popular format for storing data in microprocessors. I guess it was easier to design a microprocessor for this format. You can read more about on a linked wikipedia page.
Let's play a little bit with
# S means encode as a 2-byte unsigned integer .pack("S") => "\x00\x04"
Looking at the output you can see that this is Little-Endian format. Least significant byte
00 comes first and then most significant byte
04. You can of course encode the same value as Big-Endian by passing additional information to
# > means encode as Big-Endian .pack("S>") => "\x04\x00"
Important thing is that you need to know in which format data is encoded and to use the same format when decoding. If you mix it up you will get incorrect results:
# Here you will get incorrect result .pack("S>").unpack("S") =>  # Correct result .pack("S>").unpack("S>") => 
Signed vs Unsigned Integers
This is another thing about encoding values as binary data. Unsigned integers can represent positive values or 0, so 1 byte can represent values from 0 to 255, 2 bytes can represent values from 0 to 65536 etc. Signed integers can also represent negative values. To do that the most significant bit is used to store information about the sign. This leaves us with one less bit to encode the actual value, so 1 byte can now represent values from -128 to 127 and 2 bytes can represent values from -32,768 to 32,767. Again important thing is to be aware in what format data is encoded and use the same format for decoding. In file format specification there is always information about in what format data is encoded.
Let's look at some examples:
# Encode value as signed 2-byte integer, notice that we are using lowercase s # for encoding [-1024].pack("s").unpack("S") =>  # Use correct encoding [-1024].pack("s").unpack("s") => [-1024]
General rule with
String#unpack is that you use uppercase characters (Q, L, S and C) to indicate unsigned integers and lowercase characters (q, l, s and c) for signed integers.