Sunday, January 3, 2016

Computational complexity of a .size() method of set and map views in Java

Even though I’ve been programming in Java for many years, there are still some small bits that are new to me. Today I will write about the time complexity of a .size() method of set and map views in Java (based on OpenJDK implementation, which, in fact is a reference implementation).

In Java SE, we have an interface called SortedSet which according to the documentation is:
“A Set that further provides a total ordering on its elements. (...) Several additional operations are provided to take advantage of the ordering.”
Those additional operations let us get a view of some subset of the current set. There are 3 such methods available which names are self explanatory:
SortedSet headSet(E toElement)
SortedSet subSet(E fromElement, E toElement)
SortedSet tailSet(E fromElement)
An important part here is that the returned set is backed by the original set so changes to either of them are reflected in the other one as well.

So what about the time complexity of the .size() method of such returned set? Well, we may think it is the same as the .size() of a TreeSet, one implementation of the SortedSet interface, which is constant (O(1)):
public int size() {
       return size;
}
Source 

It turns out that in OpenJDK implementation that’s not the case. The time complexity of .size() method in such view set is linear to the number of elements in it (O(n)). The details can be found in the EntrySetView class:
abstract class EntrySetView extends AbstractSet> {
           private transient int size = -1, sizeModCount;

           public int size() {
               if (fromStart && toEnd)
                   return m.size();
               if (size == -1 || sizeModCount != m.modCount) {
                   sizeModCount = m.modCount;
                   size = 0;
                   Iterator i = iterator();
                   while (i.hasNext()) {
                       size++;
                       i.next();
                   }
               }
               return size;
           }
...
}
Source

So what is happening here? First condition checks if a view spans across entire original set and if true  delegates the call to its .size().
Later we check whether the size of the view hasn’t been calculated before or if the number of structural modifications in a view tree is different than in the original tree. If either of these conditions is true then we iterate over all the set elements and count them. Obviously that operation takes linear time to the number of elements.

How this could be solved? One possible solution would be to keep views and original sets linked to each other via references. I.e. a view set would maintain a reference to the set from which it has been derived and the original set would maintain references to all its views. Then a change in it or in any of the views could be propagated.
That would have solved the problem of the .size() time complexity but it would introduce a memory footprint and also affect the computational complexity of add/remove operations. That’s because we can have more than O(log n) unique views of a given set - in fact we can have n such views so now the add/remove operations would be O(n) as the size of all the views would have to be updated.

All in all, different approaches have different trade offs and the one chosen in OpenJDK seems sensible as the add/remove might be more common operation that .size() on a view and has smaller memory footprint than the solution depicted above but it’s still good to be aware of such details when developing high throughput and low latency applications.

UPDATE

After a while, I got curious about this case in different languages. I have decided to check C# as it has a class with similar functionality and in addition .NET Core has been open sourced so we can investigate the official code.
It turns out that they took a slightly different approach at solving this problem and ensure that getting a size of the view is a constant time operation when modifying it (but not when modifying the original set).

In C# we have a class called SortedSet which has a following method:
public virtual SortedSet GetViewBetween(
 T lowerValue,
 T upperValue
)
which corresponds to the subSet in Java SortedSet interface.

Source

In order to get a size of such SortedSet in C# we need to access a property, called "Count". According to the documentation retrievial of this value is an O(1) operation.

Source

If we dive into the code of SortedSet class we will quickly see that GetViewBetween() returns a subclass of SortedSet called TreeSubSet which only modifies the "count" variable and the "Count" property is unchanged from its super class SortedSet:
        public int Count {
            get {
                VersionCheck();
                return count;
            }
        }
Source

Now we should have a look at how the VersionCheck() is implemented in TreeSubSet:
            internal override void VersionCheck() {
                VersionCheckImpl();
            }
 
            private void VersionCheckImpl() {
                Debug.Assert(underlying != null, "Underlying set no longer exists");
                if (this.version != underlying.version) {
                    this.root = underlying.FindRange(min, max, lBoundActive, uBoundActive);
                    this.version = underlying.version;
                    count = 0;
                    InOrderTreeWalk(delegate(Node n) { count++; return true; });
                }
            }
Source

The key point here is the check if the version of a subset is different than the underlying set. The version of a set changes on any structural mutation of a set (add/remove operations). If the versions are different then this operation gets more expensive. First we find a new subset with FindRange which is O(log n) operation where n is the number of elements in the original set - thanks to Red-Black tree implementation - and then we perform an In-Order traversal which is O(n) operation where n is the size of the subset.

When the versions will be different? The modifying operations of TreeSubSet call already VersionCheck() therefore the versions will be the same so a consecutive call to Count will be a constant time operation.

Here is an example for add operation:
            internal override bool AddIfNotPresent(T item) {
 
                if (!IsWithinRange(item)) {
                    ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.collection);
                }
 
                bool ret = underlying.AddIfNotPresent(item);
                VersionCheck();
#if DEBUG
                Debug.Assert(this.versionUpToDate() && this.root == this.underlying.FindRange(min, max));
#endif
 
                return ret;
            }
Source

There is a possibility to optimize this method - to avoid O(n) (n - number of elements in the subset) counting in VersionCheck() we can update the count variable here right away - if ret is true we simply increase it by 1 as we already checked if a new value is in the subset range. Similar implementation is for a removal operation. Maybe I will submit a push request on github for that :)

Now, if we modify the original set, its version will also change. Then a consecutive call to its view's Count property will detect a version change so its expensive branch will be executed.

To summarize, the Java and C# implementations of subset size computation are very similar in terms of time complexity. Even though, in C# when modifying a subset we get a constant size operation, then the add/remove operations get more expensive. So it is all about shifting the additional cost of counting the elements from one method to another.
The shift in Java though looks more reasonable to me, especially for a case of massive modifications of a view, without a need to know its size in between. In C#, the size will be recalculated after each modification.