Dan's Musings

Common Data Structures in Common Lisp

Someone asked about building data structures with examples in Common Lisp on the Lisp Discord.

I thought, how fun. I'll write a blog post about it. So here we are.

We're going to cover creating three different types of data structures in Common Lisp: linked lists, growable vectors, and hash tables.

Before we start

I'm going to assume knowledge of some Common Lisp, and knowledge of Big-O notation. I know that might be a tall glass of water, but I'm covering knowledge from my whole CS 235 class over ten years after I attended said class in one blog post in Common Lisp and trying to write it in a single Saturday morning, so I'mma need your help here.

We're going to pretend that all the nice data structures Common Lisp gives us are gone from our memory. The only things we'll remember how to do after this rather unfortunate case of amnesia will be structs and simple-arrays. I'm trying to go fast-ish, so you might want to read up on those if you're unfamiliar.

Once we create a new data structure, all of a sudden all the functions and data structures CL provides to do the same thing will come flooding back into our memory. It's like Fifty First Dates only it'll be fifty first data structures.

Linked Lists

Overview

Well, we should probably start with Lisp's most favorite structure: the linked list. It is the LISt Processing language, after all.

Linked lists have boxes that point to other boxes. They look like this:

  +----------+        +----------+           +----------+
  | (data)   |------->| (data)   |...------->| (data)   x
  +----------+        +----------+           +----------+

In practice, a linked list is composed of a series of nodes with two fields in them: one to point to data, and the other to point to more list.

In the picture, the boxes represent a pointer that points to some data. The arrows represent a pointer that points to more boxes.

   +---+---+        +---+---+           +---+---+
   | O | O |------->| O | O |...------->| O | X |
   +---+---+        +---+---+           +---+---+
     |                |                   |
     |                |                   |
     v                v                   v
   (data)           (data)              (data)

Now wait a minute.

If we think of linked lists, not as the main datastructure, but as an interesting configuration of these pointer pairs:

   +---+---+
   | O | O |------->
   +---+---+
     |
     |
     v
   (data)

Then maybe it's not the list that's special, but the pointer pair thing.

🤔

What can I cook up...

How about a pair?

  +---+---+
  | O | O |
  +---+---+
    |   |
    v   v
  (data)(data)

How about a tree?

                     +---+---+
                     | O | O |
                     +-+-+-+-+
                       |   |
             +---------+   +---------+
             |                       |
             |                       |
         +---v---+               +---v---+
         | O | O |               | O | O |
         +-+-+-+-+               +-+-+-+-+
           |   |                   |   |
       +---+   +---+           +---+   +---+
       |           |           |           |
       |           |           |           |
   +---v---+   +---v---+   +---v---+   +---v---+
   | O | O |   | O | O |   | O | O |   | O | O |
   +-+-+-+-+   +-+-+-+-+   +-+-+-+-+   +-+-+-+-+
     |   |       |   |       |   |       |   |
     |   v       |   v       |   v       |   v
     |   (data)  |   (data)  |   (data)  |   (data)
     v           v           v           v
     (data)      (data)      (data)      (data)

etc. etc.

Implementation

Lets' make it!

Wait. What do we call the pointers in the pair? We could call them left and right or a and b but that is less helpful for intuition when we're actually making a list. But we don't want to call them data and next because we could maybe not be making a list but some other complex structure.

Let's give them crazy names like "container for antecedent record" and "container of descendant record". But that's kinda long, let's use car and cdr for short.

;;; Define a node struct
(defstruct pair
  car
  cdr)

That's... it. That's all we need to define a linked list.

Now we can haz list:

;;; New List
(defparameter my-list nil)

;;; Add to a list
(setf my-list (make-pair :car 1 :cdr my-list))
(setf my-list (make-pair :car 2 :cdr my-list))
(setf my-list (make-pair :car 3 :cdr my-list))
(setf my-list (make-pair :car 4 :cdr my-list))

;;; Traverse a list
(loop for node = my-list then (pair-cdr node)
      while node
      do (format t "~A " (pair-car node)))

;;; Remove the front item from a list
(setf my-list (pair-cdr my-list))

;;; Remove a specific item (2) from a list
(if (eql (pair-car my-list) 2)
    (setf my-list (pair-cdr my-list))
    (loop for node = my-list then (pair-cdr node)
          while (pair-cdr node)
          until (eql (pair-car (pair-cdr node)) 2)
          finally (setf (pair-cdr node) (pair-cdr (pair-cdr node)))))

However, I don't like the name pair. Pair of what? Let's call it a construction, or cons for short. Yeah. I like that.

(defstruct cons
  car
  cdr)

Woops! Look what happens when we plug that into SBCL!

* (defstruct cons car cdr)

debugger invoked on a SYMBOL-PACKAGE-LOCKED-ERROR in thread
#<THREAD tid=4436 "main thread" RUNNING {1004538143}>:
  Lock on package COMMON-LISP violated when defining CONS as a structure while
  in package COMMON-LISP-USER.
See also:
  The SBCL Manual, Node "Package Locks"
  The ANSI Standard, Section 11.1.2.1.2

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [CONTINUE      ] Ignore the package lock.
  1: [IGNORE-ALL    ] Ignore all package locks in the context of this operation.
  2: [UNLOCK-PACKAGE] Unlock the package.
  3: [ABORT         ] Exit debugger, returning to top level.

(PACKAGE-LOCK-VIOLATION #<PACKAGE "COMMON-LISP"> :SYMBOL CONS :FORMAT-CONTROL "defining ~S as a structure" :FORMAT-ARGUMENTS (CONS))
0]

It's saying "what are you doing, there's already a cons in Common Lisp."

Achievement unlocked! We remember cons now!

Wait, what about doubly-linked lists?

Well, we could do that, but then it would get rid of the cool "pair of whatever" feel of conses. Also, singly linked lists have this nice property of being more "functional". You can "make a new list" without modifying the old one:

(let* ((a (cons 1 nil))
      (b (cons 2 a)))
      (format t "b is [~{~A~^, ~}] and a is [~{~A~^ ~}]" b a))
;;; => b is [2, 1] and a is [1]

That's pretty sweet. It lets us treat those lists as immutable values instead of references like in other languages. This is the essence of functional programming.

There's also this nice property that singly-linked lists are inductive.

Consider the following function:

(defun sum (list)
 (loop
  for node = list then (cdr node)
  for sum = 0 then (+ sum (car node))
  until (null node)
  finally (return sum)))

We could write our function succinctly in terms of the base case and next step. This is like mathematical induction, and it's a very nice thing.

The more likely answer for "why no doubly linked lists" might be that "Lisp is old" and so back in 1958 2 addresses per entry was probably plenty, thank you etc., but it turns out to be pretty nice anyways to just use singly linked lists.

Pros

Conses are a powerful data structure for several reasons:

  1. They are dead simple to implement/use.
  2. They can form complex structures or simple lists.
  3. They can be used to implement other data structures.
  4. Linear time (where n is the number of conses) to grow and traverse the items a list or tree.
  5. Constant time to add or remove an item from the list (assuming you have a pointer to the predecessor to the cons you want to remove).
  6. They are conducive to functional programming (see the functional induction example above).

Cons

There are of course the usual drawbacks:

  1. Disparate memory locations can slow down access. When we allocate a new cons, it could be allocated anywhere, and this can cause CPU cache misses. Locality of memory is a strong argument against conses, but this can be optimized for in practice. Still, nothing's faster than a good old fashioned array.
  2. For lists, they use double the memory that a simple array (full of useful elements) would.
  3. Linear time access to the n-th element in a list. We can usually do better than this, especially if we use arrays.
  4. Finding something in the list is O(n) time.

Vectors

Overview

We want to make a growable vector, or a growable array of one dimension. We know about the simple-vector, but we want to make things complicated. We do this because we'd like to harness the power of the CPU cache by keeping everything in one contiguous memory location. Let's start with an array of [1, 2, 3, 4].

It might look like this:

+---+---+---+---+
| 1 | 2 | 3 | 4 |
+---+---+---+---+

But how to make sure we can grow the list to [1, 2, 3, 4, 5]?

Well, we could just make the array bigger than it needs to be and then track how many elements are actually in it:

size: 8
length: 4
array:

  +---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 |   |   |   |   |
  +---+---+---+---+---+---+---+---+

Now we have room to add more elements if needed!

size: 8
length: 5
array:

  +---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 | 5 |   |   |   |
  +---+---+---+---+---+---+---+---+

This may seem silly to just make a "big array" and then hope the list doesn't get bigger than the array size, but it's actually used in tons of places. If you're writing in C, this is everywhere. That's why there's things like maximum url lengths or maximum number of characters in a Windows path.

Indeed, some places do it on purpose, such as places accepting user input. Allowing the user to give input and then allowing that input to be arbitrarily large is just asking for trouble. If there's a limit on the user, might as well make the data structure simple too (especially in a low-level language like C).

Now we can grow the list to [1, 2, 3, 4, 5, 6, 7, 8]:

size: 8
length: 8
array:

  +---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
  +---+---+---+---+---+---+---+---+

Every time we add an element, it's O(1) time since it's just putting it in a spot.

What happens if we grow the list to [1, 2, 3, 4, 5, 6, 7, 8, 9]? Now we don't have any more room! If we wish to continue growing, we'll have to make a new array and copy the old stuff over, then add it to the list.

  +---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   |   |   |   |   |   |   |
  +-v-+-v-+-v-+-v-+-v-+-v-+-v-+-v-+---+
  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
  +---+---+---+---+---+---+---+---+---+

We just lost our O(1) time for adding to the list. An expensive operation. How can we get it back?

Well, let's make the problem smaller. A list of size one.

   +---+
   | 1 |
   +-+-+

We allocated a spot, put 1 in it. We're still in O(1) time.

Now we add 2 to the list:

   +---+
   | 1 |
   +-+-+
     |
   +-v-+---+
   | 1 | 2 |
   +---+---+

So, we allocated a list of 2 items, then copied 1 over. Then we added another element. The work we did was pretty small; we allocated a new array, copied one element over, and added one element. If we had to do that every time, we'd still be in O(1) time.

Why is this? Well, we messed with (copied over) as many elements as we added. Big-O notation in data structures follows the number of elements in the data structure. We added 1 element, so it's okay if we have to copy over 1 thing to add 1 thing to the list.

So we just need to make sure we add as many elements as we copy over to spread the price of copy-over across all the new elements.

We don't know if we will continue to add elements, of course, but assuming we did (which is kind of what Big-O notation does, it assumes "n gets large") we'd want to make sure that we made space for them.

Thus, we should always make our new array twice as big when we fill the array up and then have to copy over stuff.

Let's apply this back to our array of 8 elements, and adding the number 9.

We might just be adding 9 now and maybe some more later. We don't want to have to keep copying things and we want to keep that nice O(1) time, so we'll pretend that the list is just gonna "keep getting bigger" and allocate space for as many new elements as we already have in the list, thus doubling the array size.

  +---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   |   |   |   |   |   |   |
  +-v-+-v-+-v-+-v-+-v-+-v-+-v-+-v-+---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |   |   |   |   |   |   |   |   |
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Then we'll add our 9 into the list and update the size and length:

size: 16
length: 9
array:

  +---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   |   |   |   |   |   |   |
  +-v-+-v-+-v-+-v-+-v-+-v-+-v-+-v-+---+---+---+---+---+---+---+---+
  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |   |   |   |   |   |   |   |
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Now, this isn't actually a O(1) operation. If I add 9, I have to copy the list and then add 9, which is more like O(n) time. It only gets to O(1) time when I keep going and add 7 more elements. This is called "amortized constant time" and it's one of my favorite tricks in Computer Science.

Removing items is trivially O(1) time, since we just have to decrement the length of the array and remove that one element. Memory reclamation on downsizing the array is possible, but is left as an exercise to the reader.

Best of all, we get to keep all the nice O(1) time operations like accessing a random element of the array quickly by index while still sorta having a fast-ish grow operation.

Only sad part is now removing or adding a random item in the list is O(n), since we'd have to all the move stuff after the thing we removed to the left by one spot.

Implementation

Let's start with the basic struct:

(defstruct growable-vector
  size
  length
  array)

Here's a constructor:

(defun growable-vector (&optional (size 16))
  (make-growable-vector :size size
                        :length 0
                        :array (make-array size)))

Now we can make an add operation:

(defun add-to-growable-vector (thing vec)
  (when (= (growable-vector-length vec) (growable-vector-size vec))
    (let ((new-array (make-array (* 2 (growable-vector-size vec)))))
      (loop for i from 0 below (growable-vector-length vec)
        (setf (aref new-array i) (aref (growable-vector-array vec) i)))
      (setf (growable-vector-array vec) new-array))
    (setf (growable-vector-size vec) (* 2 (growable-vector-size vec))))
  (setf (aref (growable-vector-array vec) (growable-vector-length vec)) thing)
  (incf (growable-vector-length vec)))

Here is the remove operation:

(defun remove-from-growable-vector (index vec)
  (unless (< index (growable-vector-length vec))
    (error "Index out of bounds"))
  (let ((thing (aref (growable-vector-array vec) index)))
    (loop for i from index below (- (growable-vector-length vec) 1)
      do
        (setf (aref (growable-vector-array vec) i)
        (aref (growable-vector-array vec) (+ i 1))))
    (decf (growable-vector-length vec))
    thing))

I'd write more, but we're on a whirlwind tour, remember?

Let's turn the key and unlock that achievement, now we have growable vectors. They look like adjustable vectors with fill pointers in Common Lisp. You make a growable vector like this:

(make-array 16 :fill-pointer 0 :adjustable t)

That makes a growable vector of size 16 and length 0. You can treat it like our "big vector" example (the one that doesn't grow but is just kinda big) and add elements to it, or treat it like a growable vector that may copy elements into a new place, via the use of vector-push and vector-push-extend, respectively. You can remove elements from the end in O(1) time via vector-pop. Interestingly, there is no "remove by index" function in Common Lisp, but you can coax delete into it. There's aref for access as per above but I prefer the use of elt for better bounds checking. Finally, if we were actually doing this the hard way and needed to resize a simple array ourselves, we could use adjust-array.

Pros

  1. We have constant time (sort of) to add an element to the end of the list.
  2. Constant time removal from the end of the list.
  3. Constant time random access.
  4. CPU cache locality is preserved, so we can go fast.

Cons

  1. Can't just make trees or whatever like with conses.
  2. Linear time removal from the middle of the list.
  3. Not really constant time to add an element.
  4. Memory usage is as bad as conses in the worst case (with all those empty cells).
  5. Finding something in the list is O(n) time.

Hash Tables

Overview

Now for my favorite one: hash tables.

We want to make a data structure that can efficiently store sets, to get rid of that "finding something in the list takes O(n) time" thing from the last two data structures. We want to make a container that makes finding stuff fast.

We have vectors now, so we should probably start with those. This is because, even if we knew exactly where the element was in the list, we'd still have to fish it out of the list. This takes O(n) time with conses -- getting to the n-th thing in a list -- but constant time with vectors, so that seems like a good fit.

So, here's our empty thing:

size: 8
count: 0

  0   1   2   3   4   5   6   7
+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+

K, now we want to add something to the container in such a way as to be able to find it again quickly.

Let's add, say, the number 3 to the container. How can we be sure we'll find it quickly? Well, let's start by just putting it in the array at the index 3:

size: 8
count: 1

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |  3  |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+

Love it! Now if someone says "Look for three", we can just go to index 3 and there it is.

What if we want to add 15 to the container? We don't have 16 spots in the array. It seems a shame to make the container bigger just to have a spot for 15 when there's all those spots still empty in our current array.

What if we came up with a "lookup function" for whatever it is we're making? It would map whatever we're looking for to an index of the array.

Let's use a "lookup function" like so:

(defun lookup (thing size)
   (mod thing size))

This just takes the number modulo the size of the array. Best of all, (mod 3 8) is 3, so we don't need to move our first element, we can just add 15.

So, to add 15, that's (mod 15 8) which is 7, so we just put it in the 7th spot.

size: 8
count: 2

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |  3  |     |     |     |  15 |
+-----+-----+-----+-----+-----+-----+-----+-----+

Negative numbers can be added too now! (mod -13 8) would be a positive number less than 8, so we can add -13 to our set also!

Wait, (mod -13 8) is 3. We already have something at spot number 3. What do we do?

Let's give this problem a name. After all, it's probably going to happen a lot. Let's call this problem a collision. Like two cars trying to use the same parking lot stall.

Well, there's a few ways to handle this collision. We could simply use the next spot over, like this:

size: 8
count: 3

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |  3  | -13 |     |     |  15 |
+-----+-----+-----+-----+-----+-----+-----+-----+

Then if we can't find -13 at spot 3, we just keep scanning until we find it.

This way saves us from having to make the array bigger, but now we've just used up another spot in the array, increasing chances of collisions. Still, this is a good and very common way to do this, particularly in languages like C where allocation is expensive and annoying.

But we're not in C! We're in Lisp, and we remember conses now at this point!

We could just make each spot in the container a linked list and add stuff to the list if there's something already there.

size: 8
count: 3

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
| nil | nil | nil |  o  | nil | nil | nil |  o  |
+-----+-----+-----+--+--+-----+-----+-----+--+--+
                     |                       |
                 +---v---+    +-------+  +---v---+
                 | o | o +--->| o | x |  | o | x |
                 +-+-+-+-+    +-+-+---+  +-+-+---+
                   |            |          |
                   v            v          v
                   -13          3          15

Gnarly.

Now we can mostly find stuff in O(1) time, as long as there aren't too many collisions. If there are, we're back to O(n) time.

So if we want our plan of finding stuff quick to work, we really need to avoid collisions.

One way to do this is, again, to make the array bigger. If we had an array of size 17, for example, 15 would be at index 15, -13 would be at index 4, and 3 would be at index 3, which would make for a collision-free parking lot. However, that's like saying "money solves all problems". Yes, but that doesn't help us right now. We can just throw memory at the problem, but we want to still use all those empty spots, not to mention the cost of copying everything over.

Well, maybe another way is to be smart about our lookup function. What if we multiplied everything by 7 and then took the modulo? Well, that would give 5, 5, and 1 as the new indexes. Hmmm. Still a collision.

Turns out this is kind of a hard problem, but it's worth it. We can't just do something simple; it's got to be clever, otherwise we'll just end up with tons of collisions. It's got to really "hash up" the input. Maybe we should rename the lookup function to be a "hash function".

Achievement unlocked: lisp already has a hash function that works on basically anything already for us. It's called sxhash and I'm going to use it because hash functions are hard.

Sometimes two things just really want to collide -- the collision above still happens even with a lookup function of (mod (sxhash thing) 8) -- but mostly it can be controlled by making sure the array is big enough. Conventional wisdom says it should be about 75% full, so as to avoid collisions. If it gets "fuller" than this, we should make an array twice as big, and re-insert the stuff from the old array into the new array, sort of like we did with the growable vector, only we have to re-insert things properly because the lookup function depends on array size. If we just copied over elements, the lookup function for the newly sized array would not be able to find the elements again.

Implementation

The struct:

(defstruct table
  size
  count
  array)

The constructor:

(defun table (&optional (size 8))
  (make-table :size size
              :count 0 :array (make-array size :initial-element nil)))

The hash operation:

(defun hash (thing size)
  (mod (sxhash thing) size))

The lookup operation:

(defun table-lookup (thing table)
  (let* ((index (hash thing (table-size table)))
         (things (aref (table-array table) index)))
    (find thing things)))

Traversal:

(defun table-traverse (table)
  (loop for i below (table-size table)
        for thing = (aref (table-array table) i)
        do
        (loop for j in thing
              do
              (format t "~A " j))))

The add operation:

(defun add-to-base-array (thing array size)
  (let ((index (hash thing size)))
      (setf (aref array index)
            (cons thing (aref array index)))))

(defun add-to-table (thing table)
  (unless (table-lookup thing table)
    (when (>= (/ (+ (table-count table) 1) (table-size table)) 0.75)
      (let* ((new-size (* 2 (table-size table)))
             (new-array (make-array new-size
                                    :initial-element nil)))
          (loop for i below (table-size table)
                for thing = (aref (table-array table) i)
                do
                (loop for j in thing
                      do
                      (add-to-base-array j new-array new-size)))
          (setf (table-array table) new-array)
          (setf (table-size table) new-size)))
      (add-to-base-array thing (table-array table) (table-size table))
      (incf (table-count table))))

The remove operation:

(defun remove-from-table (thing table)
  (let* ((index (hash thing (table-size table)))
         (things (aref (table-array table) index)))
    (setf (aref (table-array table) index)
          (remove thing things)))
    (decf (table-count table))
    thing)

Pros

  1. Finding something in the container is O(1) time.
  2. Adding something to the container is O(1) time.
  3. Removing something from the container is O(1) time.
  4. Can add and remove pairs of things and only look at the hash function of the first thing in the pair, and now you've got a map.

Cons

  1. Memory usage is not perfect.
  2. Traversing the container is in a somewhat random order.
  3. Cache locality isn't necessarily preserved.

Here's what came back to us.

Conclusion

I spun us through a linked list implementation, a growable vector implementation, and a hash table implementation all in Common Lisp. I left lots and lots to the reader. What happens if I want to insert nil into any of these? What if I want to add more than one copy of the same thing to a hash table? Small-collection or other optimizations? etc. I hope the general idea is conveyed, though.