Jan's Software Engineering blog: Peterson’s locking algorithm in Java

Sunday, March 10, 2013

Peterson’s locking algorithm in Java

Recently I started reading a new book - “The Art of Multiprocessor Programming”. At the very beginning the author brings up a topic of locks’ implementation and presents Peterson’s algorithm - an elegant two-thread mutual exclusion algorithm which can be generalized for n threads but lets start slow. After a short description, that can be found on the wikipedia as well, an implementation follows:

class Peterson implements Lock {
 private boolean[] flag = new boolean[2];
 private int victim;

 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (flag[j] && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
 }
}

Now, if we take that as a sample Java implementation then it is seriously flawed. We have 2 problems here. First is a data race (mentioned in my previous post) and a possibility of instructions reordering as there is no happens-before relationship established between different threads in this case.

How we can solve it? We have two possible ways:

1) Mark each shared variable as a volatile. But that is actually insufficient in the case of the ‘flag’ array since declaring it as a volatile will only ensure that the reference value will be visible to other threads. The values of the array may still land only in the CPUs cache and not be flushed to the main memory. To fix that we have 2 options. We can use either AtomicBoolean[] or AtomicIntegerArray (as there is no AtomicBooleanArray) and assume 0/1 for false/true. In the former case we will need an object per each array element and in the latter case only an object for AtomicIntegerArray and array object.
Here is a code using AtomicIntegerArray:

class Peterson1 implements Lock {
 private final AtomicIntegerArray flag = new AtomicIntegerArray(2);
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag.set(i, 1);
  victim = i;
  while(flag.get(j) == 1 && victim == i) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag.set(i, 0);
 }
}

2) Another approach is to use the technique that I have described in my article "Flushing with a volatile". In the code above (Peterson class) we can declare ‘victim’ as a volatile. It makes writes to the victim variable visible to other threads. Now what about visibility of writes to a ‘flag’ variable? If we ensure that each write to to the ‘flag’ is followed by a write to the volatile ‘victim’, then when a thread sees an up to date ‘victim’ value, it will also see an up to date ‘flag’ value.
There are 2 places where a write to a 'flag' happens:
In lock() method:

flag[i] = true;
victim = i;

Here a write to a flag is followed by a write to a volatile victim so all is good. Then in unlock() method we have:

int i = ThreadID.get();
flag[i] = false;

If we let the unlock() method as it is, the write of a false value to the ‘flag’ variable may not be seen by the other thread. To ensure the visibility, we should add a write to a ‘victim’ variable:
victim = 1 - i; // (this is the current victim value anyway)

But that’s not all. What about reading the ‘flag’ variable? It must be preceded by a read from a volatile field. In the lock() method we have the only read from the ‘flag’ variable:

while (flag[j] && victim == i) {};

To take the advantage of the Java memory model and to make the ‘flag’ value visible, we have to reorder the above condition to:

while (victim == i && flag[j]) {};

Here is the whole code:

class Peterson2 implements Lock {
 private final boolean[] flag = new boolean[2];
 private volatile int victim;
 
 public void lock() {
  int i = ThreadID.get();
  int j = 1 - i;
  flag[i] = true;
  victim = i;
  while (victim == i && flag[j]) {};
 }

 public void unlock() {
  int i = ThreadID.get();
  flag[i] = false;
  victim = 1 - i;
 }
}

What about a microbenchmark? I have measured the execution of such task:

class IncrementingTask implements Runnable {
 Lock lock;
 int a;
 IncrementingTask(Lock lock) {
  this.lock = lock;
 }
 @Override
 public void run() {
  for(int i = 0; i < 50000000; i++) { // when 2 threads execute this task or 10^8 for 1 thread so that the amount of work and the end result are the same. The ‘a’ field must be equal to 10^8.
   lock.lock();
   a++;
   lock.unlock();
  }
 }
}

with Peterson1 and Peterson2 implementations, both with only one and two threads executing it, on:

$ cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l
12
$ java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
$ cat /proc/version Linux version 2.6.32-33-server (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #72-Ubuntu SMP Fri Jul 29 21:21:55 UTC 2011

Below are the average times (ns) for each case from 10 runs:

                   1 thread    2 threads
Peterson1    1840693030    42213715991
Peterson2    1836106908    35781376968

I find the results quiet interesting. First off, when there is no contention (column “1 thread”) the performance of both implementations is very similar with the Peterson2 being slightly ahead. When we introduce the second thread and thus increase the contention, the Peterson2 is significantly (15%) faster. But what surprises the most is the difference between the execution of the task on 1 versus 2 threads! 1 thread does the job more than 20 times faster than 2 threads! We can see clearly how expensive it is to coordinate 2 threads when they want to access the shared variable. And what if we use a Java synchronized mechanism and an intrinsic object lock? Then the average time (again from 10 runs) for 2 threads will be 2678088766 ns. It is way better than our lock implemenation but still it is slower than 1 thread with the Peterson lock acquisition overhead.

Ok, so that’s it for today. Hope you enjoyed the article and feel free to leave a comment!

8 comments:

AnonymousMarch 10, 2013 at 7:06 PM
This code is dangerously broken and should never be used.

Volatile reads and writes in Java are equivalent to monitor acquire and monitor release, also known as read acquire and write release.

These semantics, which are roughly equivalent to preventing reads from passing reads, and writes from passing writes, are not sufficient for Deckers or Peterson's algorithm. To make these algorithms robust you need a full barrier to prevent reads from passing earlier writes. Such a barrier is more expensive, which is why it has to be specifically requested.

Lockless programming is very dangerous. Any discussion of lockless programming which doesn't discuss CPU reordering and the necessary barriers to prevent it should be assumed to be incorrect.

See this resources for more information:
http://preshing.com/20120515/memory-reordering-caught-in-the-act

http://msdn.microsoft.com/en-us/library/windows/desktop/ee418650(v=vs.85).aspx
ReplyDelete
Replies
AnonymousMarch 11, 2013 at 5:10 PM
Hey Jan, I was wrong. Sorry about that.

I tend to have a very visceral reaction when I see lockless programming implementations created casually, without careful discussion of the necessary fences, but I was not knowledgeable enough about Java to be able to jump in. I found this article which makes it clear that I was wrong and that Java's 'volatile' implies full memory barriers:

http://bartoszmilewski.com/2008/11/11/who-ordered-sequential-consistency/

He does say that *all* shared variables must be declared volatile, but I'm not sure if that applies here.

I'm still not a fan of busy waits -- I found and fixed multiple busy wait performance problems last year -- but that's a separate topic.
ReplyDelete
Replies
UnknownApril 14, 2015 at 10:51 PM
please I have problem with implementing Peterson's algorithsm can some one help me
ReplyDelete
Replies
UnknownApril 14, 2015 at 10:53 PM
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/

package peterson;

import java.util.Random;
import java.util.Scanner;

/**
*
* @author special
*/
public class peterson {

/**
* @param args the command line arguments
*/ static int turn=0;
public static void main(String[] args) {

int n=1000;
int num []=new int[n];
Random rn = new Random();
for (int i = 0; i < 1000; i++) {
num[i]=rn.nextInt(1000);
}
boolean flag[] = new boolean[2];
flag[0]= false;
//flag[1]=true;

Thread t2= new Thread(){
public void run(){
flag[0] = true;
turn = 1;
while (flag[1] && turn == 1)
{

}
int temp;
for(int i=0;inum[j])
{
temp=num[i];
num[i]=num[j];
num[j]=temp;
}
}
}
System.out.println("\n\n Sorted array Decending");
for (int i = 0; i < num.length; i++) {
System.out.println(num[i]);

}
flag[0] = false;
turn=0;

}
};

Thread t1= new Thread(){
public void run(){
//flag[1] = true;
//turn = 0;
while (flag[0] && turn == 1)
{
// busy wait
}

int temp;
for(int i=0;i<num.length;i++)
{
for(int j=0;j<num.length;j++)
{
if(num[i]<num[j])
{
temp=num[i];
num[i]=num[j];
num[j]=temp;
}
}
}
System.out.println("Sorted array accending");
for (int i = 0; i < num.length; i++) {
System.out.println(num[i]);

}
flag[1] = false;
turn=1;

}

};

//Thread t1= new p0(num,flag,turn);
//Thread t2= new p1(num,flag,turn);
t2.start();
t1.start();

}

}
ReplyDelete
Replies
PawelFebruary 18, 2023 at 3:02 PM
I think that Your implementation of Peterson2 is not correct.

this is this class

class Peterson2 implements Lock {
private final boolean[] flag = new boolean[2];
private volatile int victim;

public void lock() {
int i = ThreadID.get();
int j = 1 - i;
flag[i] = true;
victim = i;
while (victim == i && flag[j]) {};
}

public void unlock() {
int i = ThreadID.get();
flag[i] = false;
victim = 1 - i;
}
}

here is example code

// global variables
int a = 0;
Petersion2 lock;

void threadCode() {
lock.lock()
a++
lock.unlock()
}

// we start threadCode in 2 threads

lets say that compiler inlines the unlock method, now we have

void threadCode() {
lock.lock()
a++

int i = ThreadID.get();
lock.flag[i] = false;
lock.victim = 1 - i;
}

now lets say compiler reorders

void threadCode() {
lock.lock()

int i = ThreadID.get();
lock.flag[i] = false;

// suspension point

a++

lock.victim = 1 - i;
}

now lets say first thread goes right to suspension point comment, it is suspended there,
and then second thread starts working. It will attempt to read flag[0], which was set to false
on the thread 1. Lets say that thread 2 sees this change (which depends on how hardware will act).

Now second thread also passes lock, and they can both run a++

So the root of the issue is that (coming back to not inlined code)

public void unlock() {
int i = ThreadID.get();
flag[i] = false;
// suspension point
victim = 1 - i;
}

If first thread goes right to suspension point, and then second thread starts it will (this second thread) read flag[0]
even though there is no "happens before" relation with "flag[0] = false" write from the first thread.

I am basing this on this website http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html
where author writes
"
When we talk about incorrectly synchronized code in the context of the Java Memory Model, we mean any code where

1) there is a write of a variable by one thread,
2) there is a read of the same variable by another thread and
3) the write and read are not ordered by synchronization
"
ordered by synchronization here means ordered by "happens before" relation.
ReplyDelete
Replies

Add comment