Sunday, March 10, 2013

Peterson’s locking algorithm in Java

Recently I started reading a new book - “The Art of Multiprocessor Programming”. At the very beginning the author brings up a topic of locks’ implementation and presents Peterson’s algorithm - an elegant two-thread mutual exclusion algorithm which can be generalized for n threads but lets start slow. After a short description, that can be found on the wikipedia as well, an implementation follows:
class Peterson implements Lock {
 private boolean[] flag = new boolean[2];
 private int victim;

 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (flag[j] && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
 }
}
Now, if we take that as a sample Java implementation then it is seriously flawed. We have 2 problems here. First is a data race (mentioned in my previous post) and a possibility of instructions reordering as there is no happens-before relationship established between different threads in this case.

How we can solve it? We have two possible ways:

1) Mark each shared variable as a volatile. But that is actually insufficient in the case of the ‘flag’ array since declaring it as a volatile will only ensure that the reference value will be visible to other threads. The values of the array may still land only in the CPUs cache and not be flushed to the main memory. To fix that we have 2 options. We can use either AtomicBoolean[] or AtomicIntegerArray (as there is no AtomicBooleanArray) and assume 0/1 for false/true. In the former case we will need an object per each array element and in the latter case only an object for AtomicIntegerArray and array object.
Here is a code using AtomicIntegerArray:
class Peterson1 implements Lock {
 private final AtomicIntegerArray flag = new AtomicIntegerArray(2);
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag.set(i, 1);
  victim = i;
  while(flag.get(j) == 1 && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag.set(i, 0);
 }
}
2) Another approach is to use the technique that I have described in my article "Flushing with a volatile". In the code above (Peterson class) we can declare ‘victim’ as a volatile. It makes writes to the victim variable visible to other threads. Now what about visibility of writes to a ‘flag’ variable? If we ensure that each write to to the ‘flag’ is followed by a write to the volatile ‘victim’, then when a thread sees an up to date ‘victim’ value, it will also see an up to date ‘flag’ value.
There are 2 places where a write to a 'flag' happens:
In lock() method:
flag[i] = true;
victim = i;
Here a write to a flag is followed by a write to a volatile victim so all is good. Then in unlock() method we have:
int i = ThreadID.get();
flag[i] = false;
If we let the unlock() method as it is, the write of a false value to the ‘flag’ variable may not be seen by the other thread. To ensure the visibility, we should add a write to a ‘victim’ variable:
victim = 1 - i; // (this is the current victim value anyway)

But that’s not all. What about reading the ‘flag’ variable? It must be preceded by a read from a volatile field. In the lock() method we have the only read from the ‘flag’ variable:
while (flag[j] && victim == i) {};
To take the advantage of the Java memory model and to make the ‘flag’ value visible, we have to reorder the above condition to:
while (victim == i && flag[j]) {};
Here is the whole code:
class Peterson2 implements Lock {
 private final boolean[] flag = new boolean[2];
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (victim == i && flag[j]) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
  victim = 1 - i;
 }
} 
What about a microbenchmark? I have measured the execution of such task:
class IncrementingTask implements Runnable {
 Lock lock;
 int a;
 IncrementingTask(Lock lock) {
  this.lock = lock;
 }
 @Override
 public void run() {
  for(int i = 0; i < 50000000; i++) { // when 2 threads execute this task or 10^8 for 1 thread so that the amount of work and the end result are the same. The ‘a’ field must be equal to 10^8.
   lock.lock();
   a++;
   lock.unlock();
  }
 }
} 
with Peterson1 and Peterson2 implementations, both with only one and two threads executing it, on:

$ cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l
12
$ java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
$ cat /proc/version Linux version 2.6.32-33-server (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #72-Ubuntu SMP Fri Jul 29 21:21:55 UTC 2011 

Below are the average times (ns) for each case from 10 runs:
                   1 thread    2 threads
Peterson1    1840693030    42213715991
Peterson2    1836106908    35781376968
I find the results quiet interesting. First off, when there is no contention (column “1 thread”) the performance of both implementations is very similar with the Peterson2 being slightly ahead. When we introduce the second thread and thus increase the contention, the Peterson2 is significantly (15%) faster. But what surprises the most is the difference between the execution of the task on 1 versus 2 threads! 1 thread does the job more than 20 times faster than 2 threads! We can see clearly how expensive it is to coordinate 2 threads when they want to access the shared variable. And what if we use a Java synchronized mechanism and an intrinsic object lock? Then the average time (again from 10 runs) for 2 threads will be 2678088766 ns. It is way better than our lock implemenation but still it is slower than 1 thread with the Peterson lock acquisition overhead.

Ok, so that’s it for today. Hope you enjoyed the article and feel free to leave a comment!

Thursday, February 28, 2013

Flushing with a volatile!

Core Java provides different ways of ensuring visibility of actions on memory performed by one thread to other threads. You probably can think of synchronization, marking a variable volatile or using a thread safe collection from java.util.concurrent package.

Today we will explore another, less obvious approach. We will use a volatile variable to ensure visibility of another non-volatile variable. Lets start with a simple but flawed example:
class BadTask implements Runnable {
 boolean keepRunning = true;
 
 @Override
 public void run() {
  while(keepRunning) {
  }
  System.out.println("Done.");
 }
}

public class VolatileExp {
 public static void main(String[] args) throws InterruptedException {
  BadTask r = new BadTask();
  new Thread(r).start();
  Thread.sleep(1000);
  r.keepRunning = false;
  System.out.println("keepRunning is false");
 }
}
The intention of this code is to let BadTask run for 1 second and after that to stop it by setting keepRunning boolean to false. As simple as it may look this code is doomed to fail - the BadTask won’t stop after 1 second and will run until you terminate the program manually. If it works fine in your environment try a different one. In my case the above code fails constantly on the following:
$ cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l
12
$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
$ cat /proc/version
Linux version 2.6.32-33-server (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #72-Ubuntu SMP Fri Jul 29 21:21:55 UTC 2011
If the program does not stop you may wonder what happend. In short - the main thread and the thread running BadTask have been executed on different cores. Each core has its own set of registers and caches. The new value of keepRunning has been written to one of these without being flushed to the main memory. Thus it is not visible to the code running on a different core.

Ok, how we can fix it? The simplest and the most correct way is to mark this variable volatile. Another approach would be to acquire a common lock when accessing it but that would be definetly an overkill.

So what we will do today? We will introduce another variable marked with a volatile keyword! In the above code it does not make much sense and is only for demonstrating some aspects of Java memory model. But think about a scenario where there are more variables of keepRunning nature. Have a look at the below code that does not have visibility problem anymore:
class BadTask implements Runnable {
        boolean keepRunning = true;
        volatile boolean flush = true;

        @Override
        public void run() {
                while(keepRunning) {
                        if(flush);
                }
                System.out.println("BadTask is done.");
        }
}

public class VolatileExp extends Thread {

        public static void main(String[] args) throws InterruptedException {
                BadTask r = new BadTask();
                new Thread(r).start();
                Thread.sleep(1000);
                r.keepRunning = false;
                r.flush = false;
                System.out.println("keepRunning is false");
        }
}
So as already mentioned we have introduced a new volatile variable “flush”. We do two things with it. First, we do a write operation in the main thread, right after modifying a non-volatile keepRunning variable. Second, in the thread running BadTask, we do a read operation on it.
Now, how come the value of keepRunning is flushed to the main memory? This is guaranteded by the current Java memory model. According to JSR133 “writing to a volatile field has the same memory effect as a monitor release, and reading from a volatile field has the same memory effect as a monitor acquire”. Thus, actions on memory done by one thread before writing to a volatile variable will be visible to another thread after reading that variable.

This is an advanced technique which should be used sparingly only when the performance has the highest priority. If you are looking for a real life adaptation of it, you can have a look at the ConcurrentHashMap from java.util.concurrent package.