Sorted Lists in Java

This post goes through an implementation a SortedList in Java which ensures its elements are kept in order and guarantees O(log(n)) run time for all the basic operations: get, contains, remove and add. The source code, javadoc, unit tests and class files for this post are available here: scottlogic-utils-mr-1.4.zip.

Sorting is one of the most common operations applied to lists and as such Java has built in mechanisms for doing it, like the Comparable and Comparator interfaces and the Collections.sort methods. These are great when you have a static list that needs to be ordered, but sometimes you want the list to remain sorted after some altering operations have been applied to it (e.g. if you add an element to an ArrayList which you've sorted, chances are that it's no longer in the right order). For some reason, the java.util package is lacking a proper SortedList, and since they're quite handy, I thought I write my own.

AlternativesAs with all data structures, whether a SortedList the right tool for the job depends on how the data is going to be used. The java.util package contains a host of different data structures, all of which have their place, but unfortunately it (at least at the moment) is missing the SortedList. A comparison between some of the the Java's built in data structures and a SortedList is given below:

add(Object elem) remove(Object elem) get(int index) contains(Object elem) multiple equal elements
ArrayList O(1)* O(n) O(1) O(n) YES
LinkedList O(1) O(n) O(n) O(n) YES
TreeSet O(log(n)) O(log(n)) N/A O(log(n)) NO
PriorityQueue O(log(n)) O(n) N/A O(n) YES
SortedList O(log(n)) O(log(n)) O(log(n)) O(log(n)) YES
* - amortized constant time (inserting n objects takes O(n) time).

If you're not likely to change the data structure much, you might want to just use an ArrayList or a regular array and sort it when you need, which can be done relatively quickly (O(n*log(n))), using Java's Collections.sort or Arrays.sort methods respectively. So long as you don't need to sort it too often, there's no problem.

If you're not sure how many times you'll need to sort it, it's quicker to ensure that the data is always kept in order. If you don't need to store multiple equal elements and don't need random access to the data (i.e. you don't need to run get(int index)) then it's probably best to use Java's TreeSet. If you need to store multiple equal elements but don't need random access to the data, or quick contains/remove methods then Java's PriorityQueue might be the way to go. If these cases don't apply, then a SortedList might be what you want.

Implementing a custom List in JavaAll Lists in Java should implement the java.util.List interface, however, rather than implement this directly, the easiest way to start off writing your own List class is have it extend the AbstractList class instead. This class implements the List interface and provides default implementations of the methods, reducing the amount of code you need to write. Most of these default implementations just throws an UnsupportedOperationException, but some are useful. For example, the default listIterator and iterator methods provide you with working iterators once you've provided an implementation for the get(int index) method.

It's also easy to configure the iterators provided by the AbstractList to exhibit fail-fast behaviour. This means that if the list is modified after the iterator has been created, other than through the iterator itself, then calling any method other than hasNext() on the iterator will (hopefully!) cause the cause a ConcurrentModificationException to be thrown, as is standard for all of Java's non-synchronized Collection classes. To do this, you just need to increment the modCount variable whenever you add or remove an element from the list. For example, the add method in the given implementation has the following structure:

public boolean add(T object){
    boolean treeAltered;
    if(object != null){
         //add the object to the list..
         ...
         treeAltered = true;
         modCount++;
    }
    return treeAltered;
}

Here I took the decision not to allow null values to be added in the list, just to simplify the other methods and since on the vast majority of applications this is what you want. If you really need to add a null value you can always just add a special singleton instance of type T which you know represents null instead of null itself. The effect of the calling modCount++; in the add method is that the following code will now throw a ConcurrentModificationException.

SortedList<Integer> list = new SortedList<Integer>();
Iterator<Integer> itr = list.iterator();
list.add(1);
itr.next();

Another thing to consider when writing a custom List class (as with any class!) is how you are going to define its type and the constructors. Since a SortedList needs a way of comparing the elements it stores, I've decided to leave the type definition simple but only supply a constructor which takes a Comparator, this constructor has the following signature:

public SortedList(Comparator<? super T> comparator)

If you're not used to Java generics then this line might look a bit odd! The type definition, "<? super T>>" just says that the given comparator must be capable of comparing elements of type T, which is the type of element that the list is able to store.

This might seem like a weird decision, since Java's built in TreeSet doesn't enforce the same restriction; it also allows a no-argument constructor. The reason is that this no-argument constructor is used to create a TreeSet which orders elements by their natural order; i.e. the order you get if you use the compareTo method on the set's elements. The draw back of this design decision is that there is no way to say that either the elements must be Comparable, or you need to supply a Comparator, so you need to add a runtime check for this (i.e. note the ClassCastException thrown by the TreeSet's add method).

If you need this behaviour with this SortedList, you can you implement a very simple subclass of it which passes in a Comparator providing this natural ordering. An example implementation of this is given in this file: SortedListNatural.java, since it's so short I've also included all the code in it below too:

public class SortedListNatural<T extends Comparable<? super T>> extends SortedList<T> {
	public SortedListNatural(){
		super(new Comparator<T>(){
			public int compare(T one, T two){
				return one.compareTo(two);
			}
		}); 
	}
}

Note that the type definition restricts the class so that only comparable objects can be stored in it; removing the need for any runtime check.

AVL Trees as Sorted ListsIn order to obtain the logarithmic run times for the standard operations, the SortedList needs to be based on some kind of balanced tree structure. I've used an AVL tree, which is pretty much the simplest form of balanced tree you can have (so less chance of mistakes!) and since it ensures the tree remains very balanced (more so than say a Red-Black Tree), it means that that the get and contains methods always run efficiently.

An AVL tree is a binary search tree, which rebalances itself whenever the height of any node's subtree becomes at least two larger than it's other subtree. The rebalancing requires implementing just two methods - the left and the right tree rotations. I won't go into all the details here, but if you're interested, a pretty easy to follow introduction to AVL trees is available at: lecture1, lecture2 and lecture3.

When it comes to actually implementing an AVL tree, the most obvious way to do it in Java is to have an inner Node class to represent the individual positions in the tree, and then have the main class hold a reference to the root Node of the tree. The Node class needs to be defined somewhat recursively, as each Node needs to maintain references to both the children nodes and their parent node. In order to use an AVL tree as a SortedList, you need this Node class to be slightly different than in a standard implementation, the changes required are that:

  1. Allow more than one node to store values that are equal in terms of the given comparator.
  2. Nodes need to remember the total number of children they have

The first alteration is required so that the tree can be used as a List rather than as a Set and the second in order to implement the get(int index) method efficiently.

The start of the Node class in the implementation looks like this:

private int NEXT_NODE_ID = Integer.MIN_VALUE;

private class Node implements Comparable<Node> {
    final int id = NEXT_NODE_ID++; //get the id and increment it..
    T value; //the data value being stored at this node 
      
    Node leftChild;
    Node rightChild;
    Node parent;

    //The "cached" values used to speed up methods..
    int height;
    int numChildren; 

    //Compares the t values using the comparator, if they are equal it uses the
    //node it - older nodes considered to be smaller..
    public int compareTo(Node other){
        int comparison = comparator.compare(value, other.value);
        return (comparison == 0) ? (id-other.id) : comparison;
    }
    ...

As the Node class is an inner class there is no need to "parameterise" the type definition, it automatically inherits the definition of the T from the SortedList parent class. The list allows multiple values to be stored by giving each node a unique id, which is incremented as each element is added. The Node's compareTo method then uses this when comparing values, so that nodes with the same value according to the comparator, are distinguished by their unique id. The height and numChildren fields are really just cached values, since their values could be obtained by examining the child nodes. It's up to the implementation to ensure that these values are maintained as changes are made to the tree. In the given implementation, this is all done in the updateCachedValues method of the Node class:

private void updateCachedValues(){
	Node current = this;
	while(current != null){
		if(current.isLeaf()){
			current.height = 0;
			current.numChildren = 0;
		} else {
			//deal with the height..
			int leftTreeHeight = (current.leftChild == null) ? 0 : current.leftChild.height;
			int rightTreeHeight = (current.rightChild == null) ? 0 : current.rightChild.height;
			current.height = 1 + Math.max(leftTreeHeight, rightTreeHeight);
			
			//deal with the number of children..
			int leftTreeSize = (current.leftChild == null) ? 0 : current.leftChild.sizeOfSubTree();
			int rightTreeSize = (current.rightChild == null) ? 0 : current.rightChild.sizeOfSubTree();                   
			current.numChildren = leftTreeSize + rightTreeSize;
		}
	   //propagate up the tree.. 
	   current = current.parent;
	}
}

So long as this method is called on the appropriate node each time the tree is structurally altered, the values will remain correct. It's not always obvious which node constitutes the "appropriate one", but it should always be the node which was altered with the lowest height in the resulting tree (it's always the case that there is one such node).

The only key method that is missing from a standard AVL tree implementation that is required to make it work as a List is the get(int index) method. As I mentioned before, this method is going to make use of the numChildren field of the Node class to so that it can be implemented efficiently. Once this field is in place, it's not difficult to implement - the method just needs to traverse the tree, making sure that it remembers how many smaller values there are than those at the current node; this effectively tells you the index of the first value stored at the current node. In the provided implementation, the code look like this:

@Override
public T get(int index){
	return findNodeAtIndex(index).value;
}

private Node findNodeAtIndex(int index){
	if(index < 0 || index >= size()){ 
		throw new IllegalArgumentException(index + " is not valid index.");
	}
	Node current = root;
	//the the number of smaller elements of the current node as we traverse the tree..
	int totalSmallerElements = (current.leftChild == null) ? 0 : current.leftChild.sizeOfSubTree();
	while(current!= null){  //should always break, due to constraint above..
		if(totalSmallerElements == index){
			break;
		}
		if(totalSmallerElements > index){ //go left..
			current = current.leftChild;
			totalSmallerElements--;
			totalSmallerElements -= (current.rightChild == null) ? 0 : current.rightChild.sizeOfSubTree();
		} else { //go right.. 
			totalSmallerElements++;
			current = current.rightChild;
			totalSmallerElements += (current.leftChild == null) ? 0 : current.leftChild.sizeOfSubTree();
		}
	}
	return current;
}

Here the sizeOfSubTree method just returns one plus the number of children values of the node. The totalSmallerElements variable effectively stores the index of the current node is maintained in lines as the tree is traversed.

Doing without recursionYou might have noticed that the code so far has been iterative rather than recursive. Generally, most operations involving trees are written using recursion, but since iterative solutions tend to be quicker, I've stuck to using iteration throughout the implementation (the only exception is with the structurallyEqualTo method which is just there for testing). For methods where you just need to traverse the tree, like the get or contains methods, turning it from a recursive method to a iterative one is just a case of adding a while loop and keeping a reference to the current Node that you're looking at. For example, you go from something like:

void method(){
    if(/*some condition holds for this node*/){
       return this;
    } else if(/*need to traverse left*/){
        return leftChild.method();
    } else {  //need to traverse right..
        return rightChild.method();
    }
}

to something like:

void method(){
    Node currentPosition = this; 
    while(currentPosition != null){
        if(/*some condition holds for currentPosition*/){
            return currentPosition;
        } else if(/*need to traverse left*/){
            currentPosition = leftChild;
        } else {  //need to traverse right..
            currentPosition = rightChild;
        }
}

The only difficulty comes when the method needs to go back to nodes that have previously been visited, (i.e. those that can't be written with just simple tail recursion). For instance, if you want to print all the elements in the tree in order; with recursion this is just a few lines:

void printAll(){
    if(leftChild != null){
        leftChild.printAll();
    }
    printValues(); //prints the values at this node..
    if(rightChild != null){
        rightChild.printAll();
    }
}

This could then be invoked on the root node to print the whole tree. It's really not obvious how to do this without using recursion! To overcome this, the Node class in the implementation defines a couple of handy iterative methods - the smallestNodeInSubTree, which finds the smallest node in the sub-tree rooted at the node and successor, which find the next largest node in the tree (so returns null for the node storing the largest values in the tree). They are defined like this:

public Node smallestNodeInSubTree(){ 
     Node current = this;
     while(current != null){
         if(current.leftChild == null){
             break;
         } else {
            current = current.leftChild;
        }
    }
    return current;
}
 
public Node successor(){
   Node successor = null;
   if(rightChild != null){
	   successor = rightChild.smallestNodeInSubTree();
   } else if(parent != null){
	   Node current = this;
	   while(current != null && current.isRightChildOfParent()){
		   current = current.parent;
	   }
	   successor = current.parent;
   }
   return successor;
}

With these in place, you could write an iterative version of the printAll method like this:

void printAll(){
    Node current = this.smallestNodeInSubTree();
    while(current != null){
        current.printValues(); //prints the values at the current node..
        current = current.successor();
    }
}

Unit TestsI always find that when writing code like this, it's really easy to make a mistake, so to build up some confidence in it I wrote some junit tests for the SortedList class which are included in the download. If you find a problem with it that's not covered by them please let me know.

Mark Rhodes

blog comments powered by Disqus