理解 Memory barrier（內存屏障） | Name5566

老匹夫 2014-09-12

展開全文

參考文獻列表：
http://en./wiki/Memory_barrier
http://en./wiki/Out-of-order_execution
https://www./doc/Documentation/memory-barriers.txt

本文例子均在 Linux（g++）下驗證通過，CPU 為 X86-64 處理器架構。所有羅列的 Linux 內核代碼也均在（或只在）X86-64 下有效。

本文首先通過范例（以及內核代碼）來解釋 Memory barrier，然后介紹一個利用 Memory barrier 實現的無鎖環形緩沖區。

Memory barrier 簡介

程序在運行時內存實際的訪問順序和程序代碼編寫的訪問順序不一定一致，這就是內存亂序訪問。內存亂序訪問行為出現的理由是為了提升程序運行時的性能。內存亂序訪問主要發生在兩個階段：

編譯時，編譯器優化導致內存亂序訪問（指令重排）
運行時，多 CPU 間交互引起內存亂序訪問

Memory barrier 能夠讓 CPU 或編譯器在內存訪問上有序。一個 Memory barrier 之前的內存訪問操作必定先于其之后的完成。Memory barrier 包括兩類：

編譯器 barrier
CPU Memory barrier

很多時候，編譯器和 CPU 引起內存亂序訪問不會帶來什么問題，但一些特殊情況下，程序邏輯的正確性依賴于內存訪問順序，這時候內存亂序訪問會帶來邏輯上的錯誤，例如：

// thread 1
while (!ok);
do(x);
 
// thread 2
x = 42;
ok = 1;

此段代碼中，ok 初始化為 0，線程 1 等待 ok 被設置為 1 后執行 do 函數。假如說，線程 2 對內存的寫操作亂序執行，也就是 x 賦值后于 ok 賦值完成，那么 do 函數接受的實參就很可能出乎程序員的意料，不為 42。

編譯時內存亂序訪問

在編譯時，編譯器對代碼做出優化時可能改變實際執行指令的順序（例如 gcc 下 O2 或 O3 都會改變實際執行指令的順序）：

// test.cpp
int x, y, r;
void f()
{
    x = r;
    y = 1;
}

編譯器優化的結果可能導致 y = 1 在 x = r 之前執行完成。首先直接編譯此源文件：

g++ -S test.cpp

得到相關的匯編代碼如下：

movl    r(%rip), %eax
movl    %eax, x(%rip)
movl    $1, y(%rip)

這里我們看到，x = r 和 y = 1 并沒有亂序。現使用優化選項 O2（或 O3）編譯上面的代碼（g++ -O2 -S test.cpp），生成匯編代碼如下：

movl    r(%rip), %eax
movl    $1, y(%rip)
movl    %eax, x(%rip)

我們可以清楚的看到經過編譯器優化之后 movl $1, y(%rip) 先于 movl %eax, x(%rip) 執行。避免編譯時內存亂序訪問的辦法就是使用編譯器 barrier（又叫優化 barrier）。Linux 內核提供函數 barrier() 用于讓編譯器保證其之前的內存訪問先于其之后的完成。內核實現 barrier() 如下（X86-64 架構）：

#define barrier() __asm__ __volatile__("" ::: "memory")

現在把此編譯器 barrier 加入代碼中：

int x, y, r;
void f()
{
    x = r;
    __asm__ __volatile__("" ::: "memory");
    y = 1;
}

這樣就避免了編譯器優化帶來的內存亂序訪問的問題了（如果有興趣可以再看看編譯之后的匯編代碼）。本例中，我們還可以使用 volatile 這個關鍵字來避免編譯時內存亂序訪問（而無法避免后面要說的運行時內存亂序訪問）。volatile 關鍵字能夠讓相關的變量之間在內存訪問上避免亂序，這里可以修改 x 和 y 的定義來解決問題：

volatile int x, y;
int r;
void f()
{
    x = r;
    y = 1;
}

現加上了 volatile 關鍵字，這使得 x 相對于 y、y 相對于 x 在內存訪問上有序。在 Linux 內核中，提供了一個宏 ACCESS_ONCE 來避免編譯器對于連續的 ACCESS_ONCE 實例進行指令重排。其實 ACCESS_ONCE 實現源碼如下：

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

此代碼只是將變量 x 轉換為 volatile 的而已。現在我們就有了第三個修改方案：

int x, y, r;
void f()
{
    ACCESS_ONCE(x) = r;
    ACCESS_ONCE(y) = 1;
}

到此基本上就闡述完了我們的編譯時內存亂序訪問的問題。下面開始介紹運行時內存亂序訪問。

運行時內存亂序訪問

在運行時，CPU 雖然會亂序執行指令，但是在單個 CPU 的上，硬件能夠保證程序執行時所有的內存訪問操作看起來像是按程序代碼編寫的順序執行的，這時候 Memory barrier 沒有必要使用（不考慮編譯器優化的情況下）。這里我們了解一下 CPU 亂序執行的行為。在亂序執行時，一個處理器真正執行指令的順序由可用的輸入數據決定，而非程序員編寫的順序。
早期的處理器為有序處理器（In-order processors），有序處理器處理指令通常有以下幾步：

指令獲取
如果指令的輸入操作對象（input operands）可用（例如已經在寄存器中了），則將此指令分發到適當的功能單元中。如果一個或者多個操作對象不可用（通常是由于需要從內存中獲取），則處理器會等待直到它們可用
指令被適當的功能單元執行
功能單元將結果寫回寄存器堆（Register file，一個 CPU 中的一組寄存器）

相比之下，亂序處理器（Out-of-order processors）處理指令通常有以下幾步：

指令獲取
指令被分發到指令隊列
指令在指令隊列中等待，直到輸入操作對象可用（一旦輸入操作對象可用，指令就可以離開隊列，即便更早的指令未被執行）
指令被分配到適當的功能單元并執行
執行結果被放入隊列（而不立即寫入寄存器堆）
只有所有更早請求執行的指令的執行結果被寫入寄存器堆后，指令執行的結果才被寫入寄存器堆（執行結果重排序，讓執行看起來是有序的）

從上面的執行過程可以看出，亂序執行相比有序執行能夠避免等待不可用的操作對象（有序執行的第二步）從而提高了效率。現代的機器上，處理器運行的速度比內存快很多，有序處理器花在等待可用數據的時間里已經可以處理大量指令了。
現在思考一下亂序處理器處理指令的過程，我們能得到幾個結論：

對于單個 CPU 指令獲取是有序的（通過隊列實現）
對于單個 CPU 指令執行結果也是有序返回寄存器堆的（通過隊列實現）

由此可知，在單 CPU 上，不考慮編譯器優化導致亂序的前提下，多線程執行不存在內存亂序訪問的問題。我們從內核源碼也可以得到類似的結論（代碼不完全的摘錄）：

#ifdef CONFIG_SMP
#define smp_mb() mb()
#else
#define smp_mb() barrier()
#endif

這里可以看到，如果是 SMP 則使用 mb，mb 被定義為 CPU Memory barrier（后面會講到），而非 SMP 時，直接使用編譯器 barrier。

在多 CPU 的機器上，問題又不一樣了。每個 CPU 都存在 cache（cache 主要是為了彌補 CPU 和內存之間較慢的訪問速度），當一個特定數據第一次被特定一個 CPU 獲取時，此數據顯然不在 CPU 的 cache 中（這就是 cache miss）。此 cache miss 意味著 CPU 需要從內存中獲取數據（這個過程需要 CPU 等待數百個周期），此數據將被加載到 CPU 的 cache 中，這樣后續就能直接從 cache 上快速訪問。當某個 CPU 進行寫操作時，它必須確保其他的 CPU 已經將此數據從它們的 cache 中移除（以便保證一致性），只有在移除操作完成后此 CPU 才能安全的修改數據。顯然，存在多個 cache 時，我們必須通過一個 cache 一致性協議來避免數據不一致的問題，而這個通訊的過程就可能導致亂序訪問的出現，也就是這里說的運行時內存亂序訪問。這里不再深入討論整個細節，這是一個比較復雜的問題，有興趣可以研究 http://www./users/paulmck/scalability/paper/whymb.2010.06.07c.pdf 一文，其詳細的分析了整個過程。

現在通過一個例子來說明多 CPU 下內存亂序訪問：

// test2.cpp
#include <pthread.h>
#include <assert.h>
 
// -------------------
int cpu_thread1 = 0;
int cpu_thread2 = 1;
 
volatile int x, y, r1, r2;
 
void start()
{
    x = y = r1 = r2 = 0;
}
 
void end()
{
    assert(!(r1 == 0 && r2 == 0));
}
 
void run1()
{
    x = 1;
    r1 = y;
}
 
void run2()
{
    y = 1;
    r2 = x;
}
 
// -------------------
static pthread_barrier_t barrier_start;
static pthread_barrier_t barrier_end;
 
static void* thread1(void*)
{
    while (1) {
        pthread_barrier_wait(&barrier_start);
        run1();
        pthread_barrier_wait(&barrier_end);
    }
 
    return NULL;
}
 
static void* thread2(void*)
{
    while (1) {
        pthread_barrier_wait(&barrier_start);
        run2();
        pthread_barrier_wait(&barrier_end);
    }
 
    return NULL;
}
 
int main()
{
    assert(pthread_barrier_init(&barrier_start, NULL, 3) == 0);
    assert(pthread_barrier_init(&barrier_end, NULL, 3) == 0);
 
    pthread_t t1;
    pthread_t t2;
    assert(pthread_create(&t1, NULL, thread1, NULL) == 0);
    assert(pthread_create(&t2, NULL, thread2, NULL) == 0);
 
    cpu_set_t cs;
    CPU_ZERO(&cs);
    CPU_SET(cpu_thread1, &cs);
    assert(pthread_setaffinity_np(t1, sizeof(cs), &cs) == 0);
    CPU_ZERO(&cs);
    CPU_SET(cpu_thread2, &cs);
    assert(pthread_setaffinity_np(t2, sizeof(cs), &cs) == 0);
 
    while (1) {
        start();
        pthread_barrier_wait(&barrier_start);
        pthread_barrier_wait(&barrier_end);
        end();
    }
 
    return 0;
}

這里創建了兩個線程來運行測試代碼（需要測試的代碼將放置在 run 函數中）。我使用了 pthread barrier（區別于本文討論的 Memory barrier）主要為了讓兩個子線程能夠同時運行它們的 run 函數。此段代碼不停的嘗試同時運行兩個線程的 run 函數，以便得出我們期望的結果。在每次運行 run 函數前會調用一次 start 函數（進行數據初始化），run 運行后會調用一次 end 函數（進行結果檢查）。run1 和 run2 兩個函數運行在哪個 CPU 上則通過 cpu_thread1 和 cpu_thread2 兩個變量控制。
先編譯此程序：g++ -lpthread -o test2 test2.cpp（這里未優化，目的是為了避免編譯器優化的干擾）。需要注意的是，兩個線程運行在兩個不同的 CPU 上（CPU 0 和 CPU 1）。只要內存不出現亂序訪問，那么 r1 和 r2 不可能同時為 0，因此斷言失敗表示存在內存亂序訪問。編譯之后運行此程序，會發現存在一定概率導致斷言失敗。為了進一步說明問題，我們把 cpu_thread2 的值改為 0，換而言之就是讓兩個線程跑在同一個 CPU 下，再運行程序發現斷言不再失敗。

最后，我們使用 CPU Memory barrier 來解決內存亂序訪問的問題（X86-64 架構下）：

int cpu_thread1 = 0;
int cpu_thread2 = 1;
 
void run1()
{
    x = 1;
    __asm__ __volatile__("mfence" ::: "memory");
    r1 = y;
}
 
void run2()
{
    y = 1;
    __asm__ __volatile__("mfence" ::: "memory");
    r2 = x;
}

準備使用 Memory barrier

Memory barrier 常用場合包括：

實現同步原語（synchronization primitives）
實現無鎖數據結構（lock-free data structures）
驅動程序

實際的應用程序開發中，開發者可能完全不知道 Memory barrier 就可以開發正確的多線程程序，這主要是因為各種同步機制中已經隱含了 Memory barrier（但和實際的 Memory barrier 有細微差別），這就使得不直接使用 Memory barrier 不會存在任何問題。但是如果你希望編寫諸如無鎖數據結構，那么 Memory barrier 還是很有用的。

通常來說，在單個 CPU 上，存在依賴的內存訪問有序：

Q = P;
D = *Q;

這里內存操作有序。然而在 Alpha CPU 上，存在依賴的內存讀取操作不一定有序，需要使用數據依賴 barrier（由于 Alpha 不常見，這里就不詳細解釋了）。

在 Linux 內核中，除了前面說到的編譯器 barrier — barrier() 和 ACCESS_ONCE()，還有 CPU Memory barrier：

通用 barrier，保證讀寫操作有序的，mb() 和 smp_mb()
寫操作 barrier，僅保證寫操作有序的，wmb() 和 smp_wmb()
讀操作 barrier，僅保證讀操作有序的，rmb() 和 smp_rmb()

注意，所有的 CPU Memory barrier（除了數據依賴 barrier 之外）都隱含了編譯器 barrier。這里的 smp 開頭的 Memory barrier 會根據配置在單處理器上直接使用編譯器 barrier，而在 SMP 上才使用 CPU Memory barrier（也就是 mb()、wmb()、rmb()，回憶上面相關內核代碼）。

最后需要注意一點的是，CPU Memory barrier 中某些類型的 Memory barrier 需要成對使用，否則會出錯，詳細來說就是：一個寫操作 barrier 需要和讀操作（或數據依賴）barrier 一起使用（當然，通用 barrier 也是可以的），反之依然。

Memory barrier 的范例

讀內核代碼進一步學習 Memory barrier 的使用。
Linux 內核實現的無鎖（只有一個讀線程和一個寫線程時）環形緩沖區 kfifo 就使用到了 Memory barrier，實現源碼如下：

/*
 * A simple kernel FIFO implementation.
 *
 * Copyright (C) 2004 Stelian Pop <stelian@popies.net>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 *
 */
 
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/kfifo.h>
#include <linux/log2.h>
 
/**
 * kfifo_init - allocates a new FIFO using a preallocated buffer
 * @buffer: the preallocated buffer to be used.
 * @size: the size of the internal buffer, this have to be a power of 2.
 * @gfp_mask: get_free_pages mask, passed to kmalloc()
 * @lock: the lock to be used to protect the fifo buffer
 *
 * Do NOT pass the kfifo to kfifo_free() after use! Simply free the
 * &struct kfifo with kfree().
 */
struct kfifo *kfifo_init(unsigned char *buffer, unsigned int size,
                         gfp_t gfp_mask, spinlock_t *lock)
{
    struct kfifo *fifo;
 
    /* size must be a power of 2 */
    BUG_ON(!is_power_of_2(size));
 
    fifo = kmalloc(sizeof(struct kfifo), gfp_mask);
    if (!fifo)
        return ERR_PTR(-ENOMEM);
 
    fifo->buffer = buffer;
    fifo->size = size;
    fifo->in = fifo->out = 0;
    fifo->lock = lock;
 
    return fifo;
}
EXPORT_SYMBOL(kfifo_init);
 
/**
 * kfifo_alloc - allocates a new FIFO and its internal buffer
 * @size: the size of the internal buffer to be allocated.
 * @gfp_mask: get_free_pages mask, passed to kmalloc()
 * @lock: the lock to be used to protect the fifo buffer
 *
 * The size will be rounded-up to a power of 2.
 */
struct kfifo *kfifo_alloc(unsigned int size, gfp_t gfp_mask, spinlock_t *lock)
{
    unsigned char *buffer;
    struct kfifo *ret;
 
    /*
     * round up to the next power of 2, since our 'let the indices
     * wrap' technique works only in this case.
     */
    if (!is_power_of_2(size)) {
        BUG_ON(size > 0x80000000);
        size = roundup_pow_of_two(size);
    }
 
    buffer = kmalloc(size, gfp_mask);
    if (!buffer)
        return ERR_PTR(-ENOMEM);
 
    ret = kfifo_init(buffer, size, gfp_mask, lock);
 
    if (IS_ERR(ret))
        kfree(buffer);
 
    return ret;
}
EXPORT_SYMBOL(kfifo_alloc);
 
/**
 * kfifo_free - frees the FIFO
 * @fifo: the fifo to be freed.
 */
void kfifo_free(struct kfifo *fifo)
{
    kfree(fifo->buffer);
    kfree(fifo);
}
EXPORT_SYMBOL(kfifo_free);
 
/**
 * __kfifo_put - puts some data into the FIFO, no locking version
 * @fifo: the fifo to be used.
 * @buffer: the data to be added.
 * @len: the length of the data to be added.
 *
 * This function copies at most @len bytes from the @buffer into
 * the FIFO depending on the free space, and returns the number of
 * bytes copied.
 *
 * Note that with only one concurrent reader and one concurrent
 * writer, you don't need extra locking to use these functions.
 */
unsigned int __kfifo_put(struct kfifo *fifo,
                         const unsigned char *buffer, unsigned int len)
{
    unsigned int l;
 
    len = min(len, fifo->size - fifo->in + fifo->out);
 
    /*
     * Ensure that we sample the fifo->out index -before- we
     * start putting bytes into the kfifo.
     */
 
    smp_mb();
 
    /* first put the data starting from fifo->in to buffer end */
    l = min(len, fifo->size - (fifo->in & (fifo->size - 1)));
    memcpy(fifo->buffer + (fifo->in & (fifo->size - 1)), buffer, l);
 
    /* then put the rest (if any) at the beginning of the buffer */
    memcpy(fifo->buffer, buffer + l, len - l);
 
    /*
     * Ensure that we add the bytes to the kfifo -before-
     * we update the fifo->in index.
     */
 
    smp_wmb();
 
    fifo->in += len;
 
    return len;
}
EXPORT_SYMBOL(__kfifo_put);
 
/**
 * __kfifo_get - gets some data from the FIFO, no locking version
 * @fifo: the fifo to be used.
 * @buffer: where the data must be copied.
 * @len: the size of the destination buffer.
 *
 * This function copies at most @len bytes from the FIFO into the
 * @buffer and returns the number of copied bytes.
 *
 * Note that with only one concurrent reader and one concurrent
 * writer, you don't need extra locking to use these functions.
 */
unsigned int __kfifo_get(struct kfifo *fifo,
                         unsigned char *buffer, unsigned int len)
{
    unsigned int l;
 
    len = min(len, fifo->in - fifo->out);
 
    /*
     * Ensure that we sample the fifo->in index -before- we
     * start removing bytes from the kfifo.
     */
 
    smp_rmb();
 
    /* first get the data from fifo->out until the end of the buffer */
    l = min(len, fifo->size - (fifo->out & (fifo->size - 1)));
    memcpy(buffer, fifo->buffer + (fifo->out & (fifo->size - 1)), l);
 
    /* then get the rest (if any) from the beginning of the buffer */
    memcpy(buffer + l, fifo->buffer, len - l);
 
    /*
     * Ensure that we remove the bytes from the kfifo -before-
     * we update the fifo->out index.
     */
 
    smp_mb();
 
    fifo->out += len;
 
    return len;
}
EXPORT_SYMBOL(__kfifo_get);

為了更好的理解上面的源碼，這里順帶說一下此實現使用到的一些和本文主題無關的技巧：

使用與操作來求取環形緩沖區的下標，相比取余操作來求取下標的做法效率要高不少。使用與操作求取下標的前提是環形緩沖區的大小必須是 2 的 N 次方，換而言之就是說環形緩沖區的大小為一個僅有一個 1 的二進制數，那么 index & (size – 1) 則為求取的下標（這不難理解）
使用了 in 和 out 兩個索引且 in 和 out 是一直遞增的（此做法比較巧妙），這樣能夠避免一些復雜的條件判斷（某些實現下，in == out 時還無法區分緩沖區是空還是滿）

這里，索引 in 和 out 被兩個線程訪問。in 和 out 指明了緩沖區中實際數據的邊界，也就是 in 和 out 同緩沖區數據存在訪問上的順序關系，由于未使用同步機制，那么保證順序關系就需要使用到 Memory barrier 了。索引 in 和 out 都分別只被一個線程修改，而被兩個線程讀取。__kfifo_put 先通過 in 和 out 來確定可以向緩沖區中寫入數據量的多少，這時，out 索引應該先被讀取后才能真正的將用戶 buffer 中的數據寫入緩沖區，因此這里使用到了 smp_mb()，對應的，__kfifo_get 也使用 smp_mb() 來確保修改 out 索引之前緩沖區中數據已經被成功讀取并寫入用戶 buffer 中了。對于 in 索引，在 __kfifo_put 中，通過 smp_wmb() 保證先向緩沖區寫入數據后才修改 in 索引，由于這里只需要保證寫入操作有序，故選用寫操作 barrier，在 __kfifo_get 中，通過 smp_rmb() 保證先讀取了 in 索引（這時候 in 索引用于確定緩沖區中實際存在多少可讀數據）才開始讀取緩沖區中數據（并寫入用戶 buffer 中），由于這里只需要保證讀取操作有序，故選用讀操作 barrier。

到這里，Memory barrier 就介紹完畢了。