Queued Spinlocks

ShareEmail this to someoneShare on RedditTweet about this on TwitterShare on FacebookShare on Google+Share on LinkedIn

Queued spinlocks are designed as replacement for standard spinlock as main synchronization mechanism in kernel mode of an OS. They provide better performance on multiprocessor systems by reducing contention of interconnection bus during busy waiting, providing fairness and locality of reference. Standard spinlocks suffers from contention of interconnection between processors caused by usage of a single memory location shared by all processors for spinning. This significant increase traffic between processors is caused by cache coherency protocol. On the other hand, when queued spinlocks are used, each processor uses its own memory location to spin which avoids memory sharing and ease traffic between processor, in addition to that, memory used for spinning is located on the stack currently used by the processor which further improves performance. Queued spinlocks also guarantee that processors will enter guarded critical section in the same order in which they arrive, which is not the case with standard spinlocks.

Implementation

// represents processor in wait queue of the spinlock
struct qsl_entry
{

    // next processor in the queue that is waiting to enter section
    qsl_entry* next;

    // indicates whether the access to section
    // has been granted to processor
    int state;

};

// queued spinlock
struct qsl
{

    // the last processor in the queue
    // that is waiting to enter section
    qsl_entry* tail;

};

// requests access to critical section guarded by the spinlock,
// if the section is already taken it puts processor to wait
// and insert it into queue
// lck - queued lock that used to guard section
// ent - entry that represent processor in queue of the spinlock
void lock_qsl(qsl* lck, qsl_entry* ent)
{
    __asm
    {
        mov eax, ent;
        mov ebx, lck;

        // prepare queue entry
        mov [eax], 0;
        mov edx, eax;
        mov [eax]qsl_entry.state, 1;

        // store it as the last entry of the queue
        lock xchg [ebx], eax;

        // if the section available - grant access to processor?
        test eax, eax;
        jz enter_section;
            // link new entry with the rest of queue
            mov [eax], edx

            // wait for processor's turn
            wait1:
                pause;
                cmp [edx]qsl_entry.state, 1;
                je wait1;

        enter_section:
    }
}

// release access to critical section guarded by the spinlock
// it also grants access to next processor in the queue if there is one
// lck - queued lock that used to guard section
// ent - entry that represent processor in queue of the spinlock
void unlock_qsl(qsl* lck, qsl_entry* ent)
{
    __asm
    {
        mov eax, ent;
        mov ebx, lck;
        xor ecx, ecx;
        mov edx, eax;

        // release section and get next processor in the queue
        // which is waiting for the section
        lock cmpxchg [ebx], ecx;

        // is this the last processor waiting for the queue?
        je exit_section;

            // wait for processor that just has started
            // entering into section
            wait2:
                pause;
                cmp [edx], ecx;
                je wait2;

            // grant access to next processor in the queue
            mov eax, [edx];
            mov [eax]qsl_entry.state, ecx;

        exit_section:
    }
}

If you need to use it in 64-bit environment you can use 64-bit wide registers (rax, rbx…) registers instead of 32-bit wide (eax, ebx…) and set type of state field in qsl_entry structure to long long instead of int.

Explanation

Let examine the function that acquires the lock more closely:

P-CPU – the last CPU that arrived to the lock. It either holding the lock or is trying to acquire the lock.
N-CPU – CPU that should get the lock next, after the CPU that is currently holds it release the lock.
C-CPU – current CPU that is executing instruction stream being discussed.

Before the execution of lock xchg [ebx], eax instruction, [ebx] is reference to lck->tail which is address of qsl_entry used the P-CPU. eax stores address of qsl_entry used by the C-CPU. After the execution is done, eax will store address of qsl_entry whose next field should be modified to point to qsl_entry of the C-CPU, and qsl->tail will store address of qsl_entry used by the C-CPU. Needless to say lock ensures that this will be done atomically if more CPU hit it simultaneously.

If qsl->tail was null, section was not occupied by any CPU. Since now eax stores this value next test is enough to see if section is available: test eax, eax and jz enter_section. If the test fails it means that the section is already locked. If that that is the case we instruct P-CPU that it needs to notify C-CPU that section is free when P-CPU exits section. We do this by writing address of qsl_entry used by the C-CPU to next field of qsl_entry used by P-CPU, or in code: mov [eax], edx.

Now we just need to wait for P-CPU to set state field of C-CPU’s qsl_entry to 0 which is implemented in wait1 loop. Note that cmp [edx]qsl_entry.state, 1 does not need lock prefix. Instructions that performs just reads or just writes of bytes, words and dwords (qwords too in case CPU is running in 64-bit mode) are guaranteed that be atomic if they are properly aligned.

Now let’s see how freeing the acquired lock works:

Before the execution of lock cmpxchg [ebx], ecx instruction, eax holds address of qsl_entry used by C-CPU, ebx stores address of lck->tail and ecx is 0. After the execution, if there where no other CPU waiting for the lock, then qsl_entry of the C-CPU and lck->tail are equal, cmpxchg was successful ([ebx] == eax) and it set lck->tail to null which will mark section as free again. Otherwise the instruction did not perform any actions, except setting C-CPU’s flags, and it means that C-CPU should notify the N-CPU that the lock is free for it to acquire. Again lockxchg instruction while it is trying to acquire the lock.

Since cmpxchg modifies CPU’s zero flag with the result of comparison we can us je exit_section instruction to skip notification when there are no CPUs waiting.

When there is CPUs waiting, we must ensure that N-CPU finish putting itself in the queue before we notify it. Since we know that C-CPU’s qsl_entry is not lck->tail, we know that next of C-CPU’s qsl_entry cannot be null. That is the period between the execution of lock xchg [ebx], eax and mov [eax], edx in function that acquires lock. So we need to wait for N-CPU to set next and we are good to go. This is implemented by wait1 loop. Finally when the loop exits, we can notify N-CPU that is free to enter the section by setting state field of its qsl_entry to 0.

Using the Code

This is a simple example:

qsl lock;
int thread_1(void* p)
{
    qsl_entry entry;
    while( 1 )
    {
        lock_qsl( &lock, &entry );
        printf( "thread " );
        printf( "%d", (int)p );
        printf( "\n" );
        unlock_qsl( &lock, &entry );
     }
     return 0;
}

int main()
{
    CreateThread( NULL, 0, (LPTHREAD_START_ROUTINE)thread_1,
        (void*)1, 0, NULL ); 
    CreateThread( NULL, 0, (LPTHREAD_START_ROUTINE)thread_1,
        (void*)2, 0, NULL ); 
    getchar();
    return 0;
}