Sunday, March 10, 2013

Peterson’s locking algorithm in Java

Recently I started reading a new book - “The Art of Multiprocessor Programming”. At the very beginning the author brings up a topic of locks’ implementation and presents Peterson’s algorithm - an elegant two-thread mutual exclusion algorithm which can be generalized for n threads but lets start slow. After a short description, that can be found on the wikipedia as well, an implementation follows:
class Peterson implements Lock {
 private boolean[] flag = new boolean[2];
 private int victim;

 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (flag[j] && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
 }
}
Now, if we take that as a sample Java implementation then it is seriously flawed. We have 2 problems here. First is a data race (mentioned in my previous post) and a possibility of instructions reordering as there is no happens-before relationship established between different threads in this case.

How we can solve it? We have two possible ways:

1) Mark each shared variable as a volatile. But that is actually insufficient in the case of the ‘flag’ array since declaring it as a volatile will only ensure that the reference value will be visible to other threads. The values of the array may still land only in the CPUs cache and not be flushed to the main memory. To fix that we have 2 options. We can use either AtomicBoolean[] or AtomicIntegerArray (as there is no AtomicBooleanArray) and assume 0/1 for false/true. In the former case we will need an object per each array element and in the latter case only an object for AtomicIntegerArray and array object.
Here is a code using AtomicIntegerArray:
class Peterson1 implements Lock {
 private final AtomicIntegerArray flag = new AtomicIntegerArray(2);
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag.set(i, 1);
  victim = i;
  while(flag.get(j) == 1 && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag.set(i, 0);
 }
}
2) Another approach is to use the technique that I have described in my article "Flushing with a volatile". In the code above (Peterson class) we can declare ‘victim’ as a volatile. It makes writes to the victim variable visible to other threads. Now what about visibility of writes to a ‘flag’ variable? If we ensure that each write to to the ‘flag’ is followed by a write to the volatile ‘victim’, then when a thread sees an up to date ‘victim’ value, it will also see an up to date ‘flag’ value.
There are 2 places where a write to a 'flag' happens:
In lock() method:
flag[i] = true;
victim = i;
Here a write to a flag is followed by a write to a volatile victim so all is good. Then in unlock() method we have:
int i = ThreadID.get();
flag[i] = false;
If we let the unlock() method as it is, the write of a false value to the ‘flag’ variable may not be seen by the other thread. To ensure the visibility, we should add a write to a ‘victim’ variable:
victim = 1 - i; // (this is the current victim value anyway)

But that’s not all. What about reading the ‘flag’ variable? It must be preceded by a read from a volatile field. In the lock() method we have the only read from the ‘flag’ variable:
while (flag[j] && victim == i) {};
To take the advantage of the Java memory model and to make the ‘flag’ value visible, we have to reorder the above condition to:
while (victim == i && flag[j]) {};
Here is the whole code:
class Peterson2 implements Lock {
 private final boolean[] flag = new boolean[2];
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (victim == i && flag[j]) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
  victim = 1 - i;
 }
} 
What about a microbenchmark? I have measured the execution of such task:
class IncrementingTask implements Runnable {
 Lock lock;
 int a;
 IncrementingTask(Lock lock) {
  this.lock = lock;
 }
 @Override
 public void run() {
  for(int i = 0; i < 50000000; i++) { // when 2 threads execute this task or 10^8 for 1 thread so that the amount of work and the end result are the same. The ‘a’ field must be equal to 10^8.
   lock.lock();
   a++;
   lock.unlock();
  }
 }
} 
with Peterson1 and Peterson2 implementations, both with only one and two threads executing it, on:

$ cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l
12
$ java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
$ cat /proc/version Linux version 2.6.32-33-server (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #72-Ubuntu SMP Fri Jul 29 21:21:55 UTC 2011 

Below are the average times (ns) for each case from 10 runs:
                   1 thread    2 threads
Peterson1    1840693030    42213715991
Peterson2    1836106908    35781376968
I find the results quiet interesting. First off, when there is no contention (column “1 thread”) the performance of both implementations is very similar with the Peterson2 being slightly ahead. When we introduce the second thread and thus increase the contention, the Peterson2 is significantly (15%) faster. But what surprises the most is the difference between the execution of the task on 1 versus 2 threads! 1 thread does the job more than 20 times faster than 2 threads! We can see clearly how expensive it is to coordinate 2 threads when they want to access the shared variable. And what if we use a Java synchronized mechanism and an intrinsic object lock? Then the average time (again from 10 runs) for 2 threads will be 2678088766 ns. It is way better than our lock implemenation but still it is slower than 1 thread with the Peterson lock acquisition overhead.

Ok, so that’s it for today. Hope you enjoyed the article and feel free to leave a comment!

8 comments:

  1. This code is dangerously broken and should never be used.

    Volatile reads and writes in Java are equivalent to monitor acquire and monitor release, also known as read acquire and write release.

    These semantics, which are roughly equivalent to preventing reads from passing reads, and writes from passing writes, are not sufficient for Deckers or Peterson's algorithm. To make these algorithms robust you need a full barrier to prevent reads from passing earlier writes. Such a barrier is more expensive, which is why it has to be specifically requested.

    Lockless programming is very dangerous. Any discussion of lockless programming which doesn't discuss CPU reordering and the necessary barriers to prevent it should be assumed to be incorrect.

    See this resources for more information:
    http://preshing.com/20120515/memory-reordering-caught-in-the-act

    http://msdn.microsoft.com/en-us/library/windows/desktop/ee418650(v=vs.85).aspx

    ReplyDelete
    Replies
    1. Hi Bruce! Thanks for the comment. I assume you are talking about the Peterson2 class. Could you point out what exactly is wrong there? I see you are concerned about the instructions reordering. That could have been a problem prior to Java 1.5 but JSR 133 fixes it (http://www.jcp.org/en/jsr/detail?id=133). It is no longer possible to reorder normal field accesses around volatile fields. Here is a bit of information about it: http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html

      Delete
    2. It's not instruction reordering that is the problem, but read/write reordering. You can have read/write reordering without instruction reordering (such as on the Xbox 360 CPU). Yes, this is confusing, but multi-core CPUs do crazy stuff under the hood.

      Dekker's and Peterson's algorithms require sequential consistency to operate correctly which no multi-core processor gives by default. Therefore it requires a full barrier in the lock() function after the writes and before the reads. Otherwise if two threads on two separate cores try to acquire the lock simultaneously then both writes from both threads will be in progress and the threads will then (potentially) read from their caches and get stale values.

      If volatile memory accesses guarantee sequential consistency then the code is correct. I don't know the Java memory model well enough to say whether this is guaranteed, and you didn't say anything which gave me confidence. A quick read of some Java memory model documentation leaves no more certain of the behavior.

      The ability to move reads past writes (to read from the cache immediately after a write) is vital to performance which is why it is only disabled when you specifically request a full (expensive) barrier.

      A read barrier or a write barrier or both is not sufficient.

      Either way, hand written locks are a bad idea. Among other things your code could spin for arbitrarily long periods of time which is a bad idea:
      http://randomascii.wordpress.com/2012/06/05/in-praise-of-idleness/

      Delete
  2. Hey Jan, I was wrong. Sorry about that.

    I tend to have a very visceral reaction when I see lockless programming implementations created casually, without careful discussion of the necessary fences, but I was not knowledgeable enough about Java to be able to jump in. I found this article which makes it clear that I was wrong and that Java's 'volatile' implies full memory barriers:

    http://bartoszmilewski.com/2008/11/11/who-ordered-sequential-consistency/

    He does say that *all* shared variables must be declared volatile, but I'm not sure if that applies here.

    I'm still not a fan of busy waits -- I found and fixed multiple busy wait performance problems last year -- but that's a separate topic.

    ReplyDelete
    Replies
    1. No worries at all. I'm glad we both have learnt something. For me it was read/write reordering without instructions reordering (as I read it can happen due to the fact that a CPU can buffer its writes and allow the subsequent read to bypass the write in its write buffer).

      The article you found is indeed really good. I think important is that he says "out of the box", meaning that you don't need to employ any other techniques to make it work. But it still depends on the implementation details. In Peterson1 class declaring 'flag' is not enough as I wrote in the post but if you break that 2 elements array into two separate variables that are volatile then you would be fine.

      What makes the Peterson2 class work are the memory barriers that are created: StoreStore (in one case we were fine, in the other we added an extra write to a volatile) and LoadLoad (we had to reorder the while condition to get it).

      Delete
  3. please I have problem with implementing Peterson's algorithsm can some one help me

    ReplyDelete
  4. /*
    * To change this license header, choose License Headers in Project Properties.
    * To change this template file, choose Tools | Templates
    * and open the template in the editor.
    */

    package peterson;

    import java.util.Random;
    import java.util.Scanner;

    /**
    *
    * @author special
    */
    public class peterson {

    /**
    * @param args the command line arguments
    */ static int turn=0;
    public static void main(String[] args) {


    int n=1000;
    int num []=new int[n];
    Random rn = new Random();
    for (int i = 0; i < 1000; i++) {
    num[i]=rn.nextInt(1000);
    }
    boolean flag[] = new boolean[2];
    flag[0]= false;
    //flag[1]=true;




    Thread t2= new Thread(){
    public void run(){
    flag[0] = true;
    turn = 1;
    while (flag[1] && turn == 1)
    {

    }
    int temp;
    for(int i=0;inum[j])
    {
    temp=num[i];
    num[i]=num[j];
    num[j]=temp;
    }
    }
    }
    System.out.println("\n\n Sorted array Decending");
    for (int i = 0; i < num.length; i++) {
    System.out.println(num[i]);

    }
    flag[0] = false;
    turn=0;

    }
    };

    Thread t1= new Thread(){
    public void run(){
    //flag[1] = true;
    //turn = 0;
    while (flag[0] && turn == 1)
    {
    // busy wait
    }


    int temp;
    for(int i=0;i<num.length;i++)
    {
    for(int j=0;j<num.length;j++)
    {
    if(num[i]<num[j])
    {
    temp=num[i];
    num[i]=num[j];
    num[j]=temp;
    }
    }
    }
    System.out.println("Sorted array accending");
    for (int i = 0; i < num.length; i++) {
    System.out.println(num[i]);

    }
    flag[1] = false;
    turn=1;

    }

    };




    //Thread t1= new p0(num,flag,turn);
    //Thread t2= new p1(num,flag,turn);
    t2.start();
    t1.start();

    }

    }

    ReplyDelete
  5. I think that Your implementation of Peterson2 is not correct.

    this is this class

    class Peterson2 implements Lock {
    private final boolean[] flag = new boolean[2];
    private volatile int victim;

    public void lock() {
    int i = ThreadID.get();
    int j = 1 - i;
    flag[i] = true;
    victim = i;
    while (victim == i && flag[j]) {};
    }

    public void unlock() {
    int i = ThreadID.get();
    flag[i] = false;
    victim = 1 - i;
    }
    }

    here is example code

    // global variables
    int a = 0;
    Petersion2 lock;

    void threadCode() {
    lock.lock()
    a++
    lock.unlock()
    }

    // we start threadCode in 2 threads

    lets say that compiler inlines the unlock method, now we have

    void threadCode() {
    lock.lock()
    a++

    int i = ThreadID.get();
    lock.flag[i] = false;
    lock.victim = 1 - i;
    }

    now lets say compiler reorders

    void threadCode() {
    lock.lock()

    int i = ThreadID.get();
    lock.flag[i] = false;

    // suspension point

    a++

    lock.victim = 1 - i;
    }

    now lets say first thread goes right to suspension point comment, it is suspended there,
    and then second thread starts working. It will attempt to read flag[0], which was set to false
    on the thread 1. Lets say that thread 2 sees this change (which depends on how hardware will act).

    Now second thread also passes lock, and they can both run a++

    So the root of the issue is that (coming back to not inlined code)

    public void unlock() {
    int i = ThreadID.get();
    flag[i] = false;
    // suspension point
    victim = 1 - i;
    }

    If first thread goes right to suspension point, and then second thread starts it will (this second thread) read flag[0]
    even though there is no "happens before" relation with "flag[0] = false" write from the first thread.

    I am basing this on this website http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html
    where author writes
    "
    When we talk about incorrectly synchronized code in the context of the Java Memory Model, we mean any code where

    1) there is a write of a variable by one thread,
    2) there is a read of the same variable by another thread and
    3) the write and read are not ordered by synchronization
    "
    ordered by synchronization here means ordered by "happens before" relation.

    ReplyDelete