Sunday, March 10, 2013

Peterson’s locking algorithm in Java

Recently I started reading a new book - “The Art of Multiprocessor Programming”. At the very beginning the author brings up a topic of locks’ implementation and presents Peterson’s algorithm - an elegant two-thread mutual exclusion algorithm which can be generalized for n threads but lets start slow. After a short description, that can be found on the wikipedia as well, an implementation follows:
class Peterson implements Lock {
 private boolean[] flag = new boolean[2];
 private int victim;

 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (flag[j] && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
 }
}
Now, if we take that as a sample Java implementation then it is seriously flawed. We have 2 problems here. First is a data race (mentioned in my previous post) and a possibility of instructions reordering as there is no happens-before relationship established between different threads in this case.

How we can solve it? We have two possible ways:

1) Mark each shared variable as a volatile. But that is actually insufficient in the case of the ‘flag’ array since declaring it as a volatile will only ensure that the reference value will be visible to other threads. The values of the array may still land only in the CPUs cache and not be flushed to the main memory. To fix that we have 2 options. We can use either AtomicBoolean[] or AtomicIntegerArray (as there is no AtomicBooleanArray) and assume 0/1 for false/true. In the former case we will need an object per each array element and in the latter case only an object for AtomicIntegerArray and array object.
Here is a code using AtomicIntegerArray:
class Peterson1 implements Lock {
 private final AtomicIntegerArray flag = new AtomicIntegerArray(2);
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag.set(i, 1);
  victim = i;
  while(flag.get(j) == 1 && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag.set(i, 0);
 }
}
2) Another approach is to use the technique that I have described in my article "Flushing with a volatile". In the code above (Peterson class) we can declare ‘victim’ as a volatile. It makes writes to the victim variable visible to other threads. Now what about visibility of writes to a ‘flag’ variable? If we ensure that each write to to the ‘flag’ is followed by a write to the volatile ‘victim’, then when a thread sees an up to date ‘victim’ value, it will also see an up to date ‘flag’ value.
There are 2 places where a write to a 'flag' happens:
In lock() method:
flag[i] = true;
victim = i;
Here a write to a flag is followed by a write to a volatile victim so all is good. Then in unlock() method we have:
int i = ThreadID.get();
flag[i] = false;
If we let the unlock() method as it is, the write of a false value to the ‘flag’ variable may not be seen by the other thread. To ensure the visibility, we should add a write to a ‘victim’ variable:
victim = 1 - i; // (this is the current victim value anyway)

But that’s not all. What about reading the ‘flag’ variable? It must be preceded by a read from a volatile field. In the lock() method we have the only read from the ‘flag’ variable:
while (flag[j] && victim == i) {};
To take the advantage of the Java memory model and to make the ‘flag’ value visible, we have to reorder the above condition to:
while (victim == i && flag[j]) {};
Here is the whole code:
class Peterson2 implements Lock {
 private final boolean[] flag = new boolean[2];
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (victim == i && flag[j]) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
  victim = 1 - i;
 }
} 
What about a microbenchmark? I have measured the execution of such task:
class IncrementingTask implements Runnable {
 Lock lock;
 int a;
 IncrementingTask(Lock lock) {
  this.lock = lock;
 }
 @Override
 public void run() {
  for(int i = 0; i < 50000000; i++) { // when 2 threads execute this task or 10^8 for 1 thread so that the amount of work and the end result are the same. The ‘a’ field must be equal to 10^8.
   lock.lock();
   a++;
   lock.unlock();
  }
 }
} 
with Peterson1 and Peterson2 implementations, both with only one and two threads executing it, on:

$ cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l
12
$ java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
$ cat /proc/version Linux version 2.6.32-33-server (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #72-Ubuntu SMP Fri Jul 29 21:21:55 UTC 2011 

Below are the average times (ns) for each case from 10 runs:
                   1 thread    2 threads
Peterson1    1840693030    42213715991
Peterson2    1836106908    35781376968
I find the results quiet interesting. First off, when there is no contention (column “1 thread”) the performance of both implementations is very similar with the Peterson2 being slightly ahead. When we introduce the second thread and thus increase the contention, the Peterson2 is significantly (15%) faster. But what surprises the most is the difference between the execution of the task on 1 versus 2 threads! 1 thread does the job more than 20 times faster than 2 threads! We can see clearly how expensive it is to coordinate 2 threads when they want to access the shared variable. And what if we use a Java synchronized mechanism and an intrinsic object lock? Then the average time (again from 10 runs) for 2 threads will be 2678088766 ns. It is way better than our lock implemenation but still it is slower than 1 thread with the Peterson lock acquisition overhead.

Ok, so that’s it for today. Hope you enjoyed the article and feel free to leave a comment!