CS Lessons #002: Data structures

There is a very good, classic book Data structures + algorithms = programs by Niklaus Wirth. I really like this title, because it explains in simple way what are the two fundamental components of any software program. With technology that we have available today: high-level programming languages, ubiquitous libraries and frameworks, those two fundamental components are hidden from us, developers. By using those libraries and frameworks we are more productive, but we are also limited by their capabilities. If you want to write great software and don’t want to be limited you should look beyond those standard tools and learn about basics.

This article should help you with that. It puts emphasis on data structures, but you will also learn about algorithms that operates on those data structures. As Wirth wrote in his book, data structures and algorithms are highly related.

In this article I will cover what are data structures, how can you measure their performance and how can you actually implement them.

What are data structures?

Data structures are objects that organize data and implement various operations that can be performed on them. All data structures share the same set of operations like adding new elements, removing elements or finding elements in data structure. The difference lies in how these data are organized inside data structure and how well particular data structure will perform some of those operations.

To write efficient program you need to be aware of this fact and choose appropriate data structure for the job. If your program requires to do a lot of insert operations and from time to time there is a need to perform search operation, you can use data structure that is very fast at adding new elements, but can be slow at searching elements.

Data Abstraction

Some developers when thinking about data structures they think in terms of implementation. Especially in cases of data structures that are built-in in programming languages such as arrays or hash-maps. You should avoid that, because when you think about implementation it will hard to understand how those data structures and algorithms work on higher level.

I think the best way to understand that is to think in an abstract way with the help of piece of paper. You can always draw how do you think data is organized and then simulate specific algorithms by changing this data on paper.

When you have the understanding of how everything works on this higher level you can start thinking about implementation in actual computer program.

Performance of algorithms

As I mentioned earlier various data structures generally implement the same set of operations, but performance of those operations may differ. To measure the performance of such operations or actually any algorithm we can use time complexity analysis.

This is quite complex topic, that requires fairly good amount of math knowledge. But if you don’t plan on inventing your own algorithms and proving their time complexity, you can live without it. In this article I will focus only on some basic rules to follow to calculate correct time complexity.

To represent time complexity, Big O notation is used. It is basically a function that depends on the size of input. For example, if array of size n (where n can be between 0 or 1000) is an input for algorithm, the time complexity of such algorithm can be O(n) or O(n^2) etc.

O(n) means that this algorithm will run in time proportional to the size of the input. Or in other words, if input size increase 10 times, run time will also increase 10 times.

Algorithm with O(n^2) time complexity will run in time proportional to the square of the input size. So if input size increase 10 times, run time will increase 100 times.

When I write about runtime of algorithms I don’t refer to the actual time in seconds, minutes etc. Time complexity is not used for expressing what will be the exact runtime of your algorithm. It operates on abstract units of time. If you have O(n) algorithm and if you measure that it runs in 5 seconds for input of size 1000, then you can safely assume that it will run in 5000 seconds for input of size 1000000.

Because time complexity operates on abstract time units and it is used for comparing how will algorithm runtime change when input size change, we can skip any constants and write only highest power element. Instead O(13n^2 + 10n + 7), we write just O(n^2).

If you consider two algorithms, one O(n) and second O(2n), and compare their runtime for n 1000 and 10000, you will see that the comparison will be exactly the same for both algorithms. Both will run 10 times slower for 10 times larger input.

Calculating time complexity

To calculate time complexity of an algorithm, you should to count all primitive operations, then think about how this total count relates to the input size and write final time complexity using Big O notation. Variable assignments, math operations, condition checks etc. are all primitive operations and are executed in relatively constant time. Because constants don’t affect time complexity, you can completely ignore those primitives. You should focus only on iterations, mainly loops and recursive calls.

For example if you have a loop that iterates over an input array of size n, your time complexity will be O(n). If you have two loops, executed one after another it will be 2*n, but time complexity will still be O(n), because we can ignore constants. Let’s look at some code:

# input array has the size of n
def print_array(array)
  # this loop will be executed n times
  array.each do |element|
    # inside loop this operation will be execute one time in constant time
    puts element
  end
end

This program is repeating one primitive operation puts element n times, so the total number of operationsis n * 1 and time complexity is O(n).

Here is another example:

# array has the size of n
def join_array(array)
  # below we have 4 primitive operations, 3 assignments and 1 method call to get
  # array size
  output = ""
  length = array.size
  index = 1


  # this loop will be executed n times
  array.each do |element|
    # convert element to string, concatenate strings and assignment, so we have
    # 3 primitive operations
    output = output + element.to_s


    # comparison and if statement so we can say those are 2 primitive operations
    if index < length
      # string concatenation, addition and 2 assignments, so 4 operations in
      # total
      output = output + ", "
      index = index + 1
    end
  end


  # 1 primitive operation
  return output
end

This method will take input array and return string with all elements joined by comma.

join_array([1,2,3])
=> "1, 2, 3"


join_array([1])
=> "1"


join_array([])
=> ""

Putting everything together, we have 4 basic operations at the beginning of the method, then we have a loop that is repeated n times. Inside this loop we have 9 basic operations, except for the last iteration, because if condition will evaluate to false. At the end we have last basic operation. In total it is 3 + n * 9 - 4 + 1, simplified to n * 9 and time complexity is O(n).

I put all those counting for basic operations as an exercise to show you bigger picture, but later on we will skip this part and will only count iterations.

Next example:

# array has the size of n
def accumulative_sum(array)
  output = []
  length = array.size


  # we need to calculate sum for each element of input array, so we need
  # to repeat this loop n times
  (0...length).each do |i|
    sum = 0


    # for 3rd element we need to sum first 3 elements of input array,
    # for 10th element we need to sum first 10 elements of input array,
    # that’s why this loop run from 0 to i, which is index of current element
    (0..i).each do |j|
      sum = sum + array[j]
    end


    output[i] = sum
  end


  output
end

Above method takes array of numbers as input and returns another array where each elements is the sum of elements from input array up to this point. Imagine that input array represents your sales data, exactly how many items you sold on each day. Output array will tell you how many items you had sold in total on each day.

accumulative_sum([1,2,3])
=> [1, 3, 6]

accumulative_sum([1,2,3,0])
=> [1, 3, 6, 6]

Take a look at inner loop. The number of iterations will depend on the value of i. The total count will look like this: 1 + 2 + 3 + 4 + ... + (n - 1) + n. Maybe this looks familiar to you. This is an arithmetic sequence and there is a well known formula for calculating arithmetic sequence sum. It is n * (a1 + an) / 2, where a1 is the first element in the sequence, an is the last element in the sequence and n is the number of elements in the sequence. Our first element equlas 1, last equals n and number of elements is n, so the formula looks like that n * (1 + n) / 2, which is (n^2 + n) / 2. After ignoring constants, final time complexity is O(n^2).

Of course you don't have to always use such formulas to calculate time complexity. With loops you should always take time for worst case scenario. In the example above, to calculate the last element of output array we need to sum all elements of input array, which means doing n operations. So the time complexity of the inner loop is O(n), and the time complexity of whole method is O(n^2), because inner loop is repeated n times.

Please note that presented implementation of accumulative sum is not optimal. I showed it here to demonstrate algorithm with time complexity O(n^2). Here is better implementation:

def fast_accumulative_sum(array)
  sum = 0
  output = []
  length = array.size

  (0...length).each do |i|
    sum = sum + array[i]
    output[i] = sum
  end

  output
end

In this implementation we maintain sum of already processed elements and in each iteration we add one more element. This method is O(n), because there is only one loop.

Think about it for a second. Such simple change in the algorithm caused change in time complexity from O(n^2) to O(n). Look at how runtime will change If we increase input 1000 times. The first method will run proportionally 1000000 times slower (1000^2), whereas the second method will run 1000 times slower. You can try to run both methods on bigger arrays and compare results.

To sum up this section here are steps that you should follow when calculating time complexity of an algorithm:

Look at main loop and count how many times it will be executed, as a function of input size
Look at loop’s body and calculate what is the time complexity of the loop’s body. You generally need to repeat this process from point 1
Multiple number of iterations of main loop by time complexity of loop’s body
You have final time complexity

Let’s look at more examples:

# array is our input
array = [...]

# size of input array
n = array.size

# total complexity is O(n^3), because it is n * O(n^2)
# repeat n times
(0..n).each do |i|
  # body of this loop is O(n^2), because it is n * O(n)

  # repeat n times
  (0..n).each do |j|
    # body of this loop is O(n), because it is n * O(1)

    # repeat n times
    (0..n).each do |k|
      # body of this loop is O(1)

      # perform some simple, constant operations
    end
  end
end

Here is the same example but implemented with smaller methods:

# this method is O(n)
def loop3(array)
  n = array.size

  # repeat n times
  (0..n).each do |i|
    # body of this loop is O(1)

    # perform some simple, constant operations
  end
end

# this method is O(n^2)
def loop2(array)
  n = array.size

  # repeat n times
  (0..n).each do |i|
    # body of this loop is O(n), because we are calling loop3 method which is O(n)

    loop3(array)
  end
end

# this method is O(n^3)
def loop1(array)
  n = array.size

  # repeat n times
  (0..n).each do |i|
    # body of this loop is O(n^2), because we are calling loop2 method which is O(n^2)

    loop2
  end
end

# total complexity is O(n^3), because we are calling loop1 method which is O(n^3) only once
loop1(array)

Logarithms

So far we saw examples of time complexities such as O(n), O(n^2) etc. Basically it is n to the power of 1, 2, 3 etc. Of course that’s not everything. There is one common function that will appear quite often. It is log n and it means logarithm of n with base 2. Logarithm is a mathematical function and it is an inverse of power. For example log 1024 equals 10, because 2^10 is 1024. In math you can have logarithms with any base, but in computer science almost in all cases it will be logarithm with base 2.

Hidden time complexity

You should be aware of hidden complexity of some built-in methods. Look at example code:

# array of size n
length = array.size

# this is another type of loop, we repeat this code n times
length.times do
  # this looks like a primitive operation
  array.max
end

Inside this loop we have only one method call, so you may assume that this is basic operation and the whole block of code is O(n), but calculating maximum value of an array is actually a O(n) operation. So the time complexity of this code is O(n^2).

Comparing two data structures

It is time to finally implement some data structures and compare well they perform at same operations. We will implement two types of arrays, first with default order and other with sorted order. We will implement only insert and find operations.

class DefaultOrderArray
  def initialize
    @data = []
  end

  def insert(elem)
    @data.push(elem)
  end

  def find(elem)
    index = 0

    while index < @data.length
      if @data[index] == elem
        return index
      end
      index += 1
    end

    return nil
  end
end

Insert operation is trivial, it only pushes element at the end of an array. Find is more complicated, but still very simple. It starts with index 0, which points at the first element in the @data array. Algorithm then checks each element in array and compares it with element that we are looking for. If element is found algorithm will return index of this element. Otherwise it will return nil.

Insert operation is O(1), because there is no loop. Find operation is O(n), because in the worst case we need to iterate over whole array if the element is at the end or is not in the array at all.

Here is how you can use this data structure:

array = DefaultOrderArray.new
array.insert(1)
array.insert(2)
array.insert(4)

array.find(1)
=> 0
array.find(2)
=> 1
array.find(3)
=> nil

Now let’s implement our array with sorted order.

class SortedOrderArray
  def initialize
    @data = []
  end

  def insert(elem)
    index = 0

    while index < @data.length && @data[index] < elem
      index += 1
    end

    rindex = @data.length

    while rindex > index
      @data[rindex] = @data[rindex - 1]
      rindex -= 1
    end

    @data[index] = elem
  end

  def find(elem)
    start = 0
    finish = @data.length - 1

    while start <= finish
      middle = (start + finish) / 2

      if @data[middle] == elem
        return middle
      elsif @data[middle] > elem
        finish = middle - 1
      else
        start = middle + 1
      end
    end

    return nil
  end
end

This data structure is more complicated than array with default order. Because we need to maintain sorted order of elements when inserting new element we need to first find the correct place for it, put it there and then move the rest of elements one place to the right.

# We want to insert element 3 inside an array below
[1, 2, 4, 5, 6]
       * <- here is where we need to insert this element

# First we need to make a room for this element by moving all elements to the
# right

[1, 2, 4, 4, 5, 6]
       * <- here is the same place, where we insert our new element

# After moving all elements to the right, we can insert new element
[1, 2, 3, 4, 5, 6]

First algorithm looks for correct place to insert new element by iterating over array and checking if current element is smaller than new element. If element that is bigger than new element is found it means that this is the place when new element should be inserted. Looking for correct place can take O(n) time, because in worst case we need to insert new element at the end.

Then algorithm needs to move rest of elements to the right. This requires to iterate over rest of elements in the array and it can also can take O(n) time in worst case when we need to insert new element at the beginning.

In total this operation is O(n). Comparing this to array with default order inserting is much slower, so you may start thinking that find operation should be much faster. And indeed it is. To implement find operation we are using algorithm called binary search and it can only work on sorted arrays.

Imagine a game with two people. First person thinks about a number from 1 to 100, second person guesses the number and if it is not correct number, first person tells if the number is smaller or greater. The goal is to guess correct number in smaller amount of turns. So what is the first guess? It should be 50. Then depending if number is smaller or greater the second guess should 25 or 75 and so on.

That is how binary search works. It looks at the middle element of an array. If this element is smaller than the one we are looking for, it means we it should check only the right half of an array. If element is bigger, it means it should check only the left half of an array. In the next iteration it looks again at the middle element of remaining array and repeats the process.

In each iteration algorithm reduces the problem by half by narrowing search area to only half of remaining array. Variables start and finish are used to keep track of current search area inside input array. In the end algorithm will end up with search area containing only one element. If this is the element we are looking for then algorithm returns its index. If not it returns nil.

So what is the time complexity of binary search? We know that it will end if the remaining array has size of 1. Algorithm starts with size of n and in each iteration it narrows the search area to only half of an input array. So we have n, n / 2, n / 4, n / 8, ... 1. The question is how many times can you divide a number by 2 to finally reach number 1. The answer is logarithm of n with base 2. So the time complexity is O(log n).

Now let’s compare this with find operation from array with default order, which is O(n). Array with default order will require time proportional to number of elements, so for array with one million elements it is one million operations. For array with sorted order it is only 20 operations, because log 1000000 equals 20 (approximately). log 1000000000 equals 30. As you can see the difference is huge, because logarithm is a function that grows really slowly.

Let’s put final comparison of both data structures:

       | Default | Sorted
-------+---------+---------
Insert | O(1)    | O(n)
-------+---------+---------
Find   | O(n)    | O(log n)

DefaultOrderArray is extremely fast at inserting elements, but slow at finding elements. On the other hand SortedOrderArray is slow at inserting, but very fast at finding elements. Knowing that, if you have specific problem that requires one of these operations to be very efficient you will know what to choose.

And this of course works the same for any computing problem. You need to list what operations are required to solve the problem, what data structures are available, what are their time complexities and then choose what will be the most efficient.

Amortized time complexity

I wrote that inserting element at the end of Ruby array with push method is constant operation. It is not exactly right. Ruby implementation of Array is very sophisticated comparing to other programming languages. Basic difference is that Ruby arrays have unlimited size (limited only by available memory).

When you create new array in Ruby it has some initial size and when you reach this limit and you want to add another element, Ruby needs to first increase its size. It does this by creating new array with bigger size and copying all elements from current array to the new one and then replacing the references. This all happens behind the scenes, so you as a developer don't have to worry about that.

This is another example of hidden complexity. The most common growth factor is 2, so it means that when you have array with size of 100, and wants to add new elements, Ruby will allocate new array of size 200. Then you can insert another 100 elements and each insert will be in O(1) time. But when you insert additional element past 200, then Ruby will need to allocate another new array of size 400. So this particular insert operation will be more costly.

In situations like this we call it amortized time complexity. So push method from Ruby array has amortized time O(1). Even if sometimes it will take O(n), when there is a need to allocate new array, after such costly operation you can perform n times new inserts in O(1) time before the need for another allocation. So the O(n) time for allocating new array is amortized across sequence of operations.

Summary

At this point you should have a good glimpse at what data structure are, how you can implement them, how you can calculate time complexity of various operations and how can you choose the right data structure for the job. In next articles I will build on basics presented here and introduce more advanced data structures and related algorithms.