Data structures (Priority queues)

2015-06-20 21:57 by Ian

This is an entry from the Manuvr Blog that I am cross-posting here.

This will be an examination of one of the two key data structures in ManuvrOS: The Priority Queue.

Additionally, this particular priority queue is...


What is a "data structure"?

In the beginning, there were types. Which are not structures. But they are related-enough to cause massive amounts of confusion. So we will dwell on them for a moment.

Programmers need to consider many different kinds of data. It must be the programmer that thinks about this, because the CPU only 'understands' a handful of native types (called primitives). Typically these are...

Not coincidentally, these are the only native types that we have in C. Any kind of data that is not in that list is a software composition of those elemental types.

Strings occupy a strange strata in computer science where "type" bleeds into "structure" because it might be viewed as a very high-level type, or a very simple data structure.

To your CPU, there is no such thing as a 'string'. There are only those primitives given above. Although some languages that also specify their CPU (as Java does) have CPUs that consider strings to be not only a type, but a primitive type. And Java pays steep computational penalties for that choice.

I would argue: The idea of a 'null-terminated string' is properly viewed as a simple data structure, and not a type. Types have sizes known at compile-time. Furthermore, the interpretation of the data stored in a string is a matter of convention enforced by software libraries at run-time (string.h), and not your compiler.

Many data structures (like strings) have static type compositions. IE, a string will only ever consist of characters. But as most of you are likely aware, programmers are engines of inductive logic, and often need to plan for things they can't know ahead of time.

We need....

Data structures with abstract types

Which data is connected to what other data is itself data. As is the manner of its connection and order of things connected. For this sort of data, programmers use an abstract data type.

When I told you above that "the only native types that we have in C" were primitives, I lied. Most languages (even C!) have at least one abstract data structure built-into them that is treated as if it were an elemental type. Arrays.

The type composition of an array is...

  1. its size (an integer)
  2. its type.

From this information alone (assuming the type is defined), the compiler can derive every other relevant fact about (data-access, offsets, sizes, and pointers) that it needs to understand your code, and tell the CPU (that only knows about those basic types) how to deal with your data.

And in this way, we comfortably lose sight of the fact that arrays are nothing more than syntactic sugar. They have no fundamental low-level representation, and they can be used to hold any data you can imagine.

This is the essence of an abstract data structure (ADS); a data structure used with abstract types.

Please note that an abstract data structure (or type) is not the same thing as a template (which are simply high-level instructions to the compiler). Often times, templates are used as a means to write an ADS (as I did with this one), but they are quite different things. See the post on templates if this is confusing.

The goal of the abstraction varies, but the key thing to note here is that the structure doesn't care about the type of its payload. It only cares about containing the data about the payload's connections.

Engineering considerations for priority queues

Abstract data structures have a property that I can't name, but it pertains to Turing completeness and universal gates in that: Given a data storage problem, basically any ADS can be used to solve it. Sometimes, you will find certain kinds of problems that so perfectly match a structure's organizing principles that you can use the structure itself as the code to solve the problem (uses the same priority queue under discussion here).

So clearly, the definition of a "good ADS selection" will depend on the problem.

I happen to like the versatility of priority queues. For an extra 4 bytes (to store that priority value), you get the ability to...

This concept of priority is enforced by a combination of an integer member of the carrier class, and the order of the connection of the list elements (PriorityQueue is implemented on top of a linked list). Note that we could have gotten a priority queue without the linked list. We could have used an array instead because the integer member in the carrier class is alone sufficient to get the same outward-facing behavior.

So why choose a linked list?

ManuvrOS is meant to run in places where CPU and memory bandwidth are precious. The linked-list allows us to avoid copying memory if/when we exceed some initial capacity of the queue. Because data within a linked list need not be contiguous, our cost to grow and shrink a linked list is very small because we don't need to copy data. And that means we don't need to care how much memory the data might occupy when we create the queue.

To implement this structure in an array would mean that we would need to face a stark trade-off each time we instantiate a new queue. We would either...

  1. Create the queue with an initial size of 1 (for lowest resting memory load), and when we grow the queue, eat the CPU by allocating space for another array (peak mem usage x2), and copy the data into it. Or....

  2. Throw a bunch of memory at the queue in anticipation that it will be grown if we don't. That is: trade memory for CPU. Because that allocated memory is unusable for anything else while it is tied up for data structure allocation.

By implementing on top of a linked list (versus the array), these are the problems avoided at the cost of the extra 4 bytes per datum to store the "->next" pointer.

CPU sees a benefit to this choice as well (apart from not having to copy payload data), because it means insertion time into the queue is (in the worst-case) linear with respect to the size of the list. It is linear because we need to seek to the end of the list to insert a new payload. But since it is a priority queue, our insertion time will be (on-average) better than linear to the extent that we insert higher-priority items as the queue gets deeper. We also have to compare an int, but this is a trivial cost.

The worst lookup time would be incurred by client code that did an iterative retrieval on the queue using the numeric position of the desired item. In this use-case, an array would have O(C). Which is the holy-grail. As it stands today, PriorityQueue would be O(n2). But this is not the problem queues are meant to solve.