文章/答案/技术大牛

发布

社区首页 >问答首页 >与Windows 7相比，Windows 10性能较差(页面错误处理是不可扩展的，当没有线程>16时，会出现严重的锁定争用)

问与Windows 7相比，Windows 10性能较差(页面错误处理是不可扩展的，当没有线程>16时，会出现严重的锁定争用)
EN

Stack Overflow用户

提问于 2017-07-11 01:39:10

回答 2查看 5.3K关注 0票数 30

我们设置了两个相同的HP Z840工作站，其规格如下

2xXeonE5-2690 v4 @ 2.60GHz (涡轮增压，HT关闭，共28个逻辑CPU)
32 DDR4 2400内存，四通道

并分别安装了Windows7 SP1 (x64)和Windows 10 Creators (x64)。

然后，我们运行一个小内存基准(下面的代码，由VS2015更新3，64位架构构建)，它执行内存分配-填充-同时从多个线程。

#include <Windows.h>
#include <vector>
#include <ppl.h>

unsigned __int64 ZQueryPerformanceCounter()
{
    unsigned __int64 c;
    ::QueryPerformanceCounter((LARGE_INTEGER *)&c);
    return c;
}

unsigned __int64 ZQueryPerformanceFrequency()
{
    unsigned __int64 c;
    ::QueryPerformanceFrequency((LARGE_INTEGER *)&c);
    return c;
}

class CZPerfCounter {
public:
    CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {};
    void reset() { m_st = ZQueryPerformanceCounter(); };
    unsigned __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st; };
    unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000 / m_freq); };
    unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000 / m_freq); };
    static unsigned __int64 frequency() { return m_freq; };
private:
    unsigned __int64 m_st;
    static unsigned __int64 m_freq;
};

unsigned __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency();



int main(int argc, char ** argv)
{
    SYSTEM_INFO sysinfo;
    GetSystemInfo(&sysinfo);
    int ncpu = sysinfo.dwNumberOfProcessors;

    if (argc == 2) {
        ncpu = atoi(argv[1]);
    }

    {
        printf("No of threads %d\n", ncpu);

        try {
            concurrency::Scheduler::ResetDefaultSchedulerPolicy();
            int min_threads = 1;
            int max_threads = ncpu;
            concurrency::SchedulerPolicy policy
            (2 // two entries of policy settings
                , concurrency::MinConcurrency, min_threads
                , concurrency::MaxConcurrency, max_threads
            );
            concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
        }
        catch (concurrency::default_scheduler_exists &) {
            printf("Cannot set concurrency runtime scheduler policy (Default scheduler already exists).\n");
        }

        static int cnt = 100;
        static int num_fills = 1;
        CZPerfCounter pcTotal;

        // malloc/free
        printf("malloc/free\n");
        {
            CZPerfCounter pc;
            for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) {
                concurrency::parallel_for(0, 50, [i](size_t x) {
                    std::vector<void *> ptrs;
                    ptrs.reserve(cnt);
                    for (int n = 0; n < cnt; n++) {
                        auto p = malloc(i);
                        ptrs.emplace_back(p);
                    }
                    for (int x = 0; x < num_fills; x++) {
                        for (auto p : ptrs) {
                            memset(p, num_fills, i);
                        }
                    }
                    for (auto p : ptrs) {
                        free(p);
                    }
                });
                printf("size %4d MB,  elapsed %8.2f s, \n", i / (1024 * 1024), pc.elapsedMS() / 1000.0);
                pc.reset();
            }
        }
        printf("\n");
        printf("Total %6.2f s\n", pcTotal.elapsedMS() / 1000.0);
    }

    return 0;
}

令人惊讶的是，与Windows 7相比，Windows 10 CU中的结果非常糟糕。我为1MB的块大小和8MB的块大小绘制了下面的结果，将线程数量从2,4，.变化到28。当我们增加线程数时，Windows 7的性能略差一些，而Windows 10的可伸缩性则差得多。

我们试图确保所有的Windows更新都被应用，更新驱动程序，调整BIOS设置，但没有成功。我们还在其他几个硬件平台上运行了相同的基准，并且都给出了类似于Windows 10的曲线，所以这似乎是Windows 10的一个问题。

有没有人有过类似的经历，或者可能知道--这件事怎么样(也许我们错过了什么?)这种行为使我们的多线程应用程序的性能受到了很大的影响。

*编辑

使用https://github.com/google/UIforETW (感谢Bruce )来分析基准测试，我们发现大部分时间都花在内核KiPageFault中。深入挖掘调用树，所有这些都会导致ExpWaitForSpinLockExclusiveAndAcquire。似乎锁争用导致了这个问题。

*编辑

在同一硬件上收集服务器2012 R2数据。2012年服务器R2也比Win7差，但仍然比Win10 CU好得多。

*编辑

这也发生在Server 2016中。我添加了windows-server-2016的标签。

*编辑

使用@Ext3h中的信息，我修改了基准测试以使用VirtualAlloc和VirtualLock。我可以确认与不使用VirtualLock相比有了很大的改进。总的来说，当使用Win10和VirtualLock时，Win7仍然比Win7慢30%到40%。

windows

memory-management

windows-7

windows-10

windows-server-2016

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-11-14 02:36:43

Microsoft似乎已经解决了Windows 10秋季创建者更新和Windows 10的问题。

这是最新的图表。

Win 10 FCU和WKS的开销比Win 7低。作为交换，VirtualLock似乎有更高的开销。

票数 9

Stack Overflow用户

发布于 2017-07-30 20:56:13

不幸的是，没有一个答案，只是一些额外的洞察力。

采用不同的分配策略的小实验：

#include <Windows.h>

#include <thread>
#include <condition_variable>
#include <mutex>
#include <queue>
#include <atomic>
#include <iostream>
#include <chrono>

class AllocTest
{
public:
    virtual void* Alloc(size_t size) = 0;
    virtual void Free(void* allocation) = 0;
};

class BasicAlloc : public AllocTest
{
public:
    void* Alloc(size_t size) override {
        return VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
    }
    void Free(void* allocation) override {
        VirtualFree(allocation, NULL, MEM_RELEASE);
    }
};

class ThreadAlloc : public AllocTest
{
public:
    ThreadAlloc() {
        t = std::thread([this]() {
            std::unique_lock<std::mutex> qlock(this->qm);
            do {
                this->qcv.wait(qlock, [this]() {
                    return shutdown || !q.empty();
                });
                {
                    std::unique_lock<std::mutex> rlock(this->rm);
                    while (!q.empty())
                    {
                        q.front()();
                        q.pop();
                    }
                }
                rcv.notify_all();
            } while (!shutdown);
        });
    }
    ~ThreadAlloc() {
        {
            std::unique_lock<std::mutex> lock1(this->rm);
            std::unique_lock<std::mutex> lock2(this->qm);
            shutdown = true;
        }
        qcv.notify_all();
        rcv.notify_all();
        t.join();
    }
    void* Alloc(size_t size) override {
        void* target = nullptr;
        {
            std::unique_lock<std::mutex> lock(this->qm);
            q.emplace([this, &target, size]() {
                target = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
                VirtualLock(target, size);
                VirtualUnlock(target, size);
            });
        }
        qcv.notify_one();
        {
            std::unique_lock<std::mutex> lock(this->rm);
            rcv.wait(lock, [&target]() {
                return target != nullptr;
            });
        }
        return target;
    }
    void Free(void* allocation) override {
        {
            std::unique_lock<std::mutex> lock(this->qm);
            q.emplace([allocation]() {
                VirtualFree(allocation, NULL, MEM_RELEASE);
            });
        }
        qcv.notify_one();
    }
private:
    std::queue<std::function<void()>> q;
    std::condition_variable qcv;
    std::condition_variable rcv;
    std::mutex qm;
    std::mutex rm;
    std::thread t;
    std::atomic_bool shutdown = false;
};

int main()
{
    SetProcessWorkingSetSize(GetCurrentProcess(), size_t(4) * 1024 * 1024 * 1024, size_t(16) * 1024 * 1024 * 1024);

    BasicAlloc alloc1;
    ThreadAlloc alloc2;

    AllocTest *allocator = &alloc2;
    const size_t buffer_size =1*1024*1024;
    const size_t buffer_count = 10*1024;
    const unsigned int thread_count = 32;

    std::vector<void*> buffers;
    buffers.resize(buffer_count);
    std::vector<std::thread> threads;
    threads.resize(thread_count);
    void* reference = allocator->Alloc(buffer_size);

    std::memset(reference, 0xaa, buffer_size);

    auto func = [&buffers, allocator, buffer_size, buffer_count, reference, thread_count](int thread_id) {
        for (int i = thread_id; i < buffer_count; i+= thread_count) {
            buffers[i] = allocator->Alloc(buffer_size);
            std::memcpy(buffers[i], reference, buffer_size);
            allocator->Free(buffers[i]);
        }
    };

    for (int i = 0; i < 10; i++)
    {
        std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
        for (int t = 0; t < thread_count; t++) {
            threads[t] = std::thread(func, t);
        }
        for (int t = 0; t < thread_count; t++) {
            threads[t].join();
        }
        std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
        std::cout << duration << std::endl;
    }


    DebugBreak();
    return 0;
}

在所有正常的情况下，BasicAlloc是更快的，就像它应该的那样。事实上，在一个四核CPU (没有HT)上，没有一个星座ThreadAlloc可以超过它。ThreadAlloc总是慢30%左右。(这实际上是令人惊讶的少，而且即使是在1kB的微小分配中也是如此！)

然而，如果CPU有8-12个虚拟核，那么它最终会到达BasicAlloc实际上是负扩展的点，而ThreadAlloc只是“停滞”在软故障的基线上。

如果您分析了这两种不同的分配策略，可以看到对于低线程数，KiPageFault从memcpy on BasicAlloc转移到VirtualLock on ThreadAlloc。

对于更高的线程和内核计数，ExpWaitForSpinLockExclusiveAndAcquire最终开始从几乎为零负载到50%的BasicAlloc开始出现，而ThreadAlloc只维持来自KiPageFault本身的恒定开销。

ThreadAlloc的摊子也很糟糕。无论一个NUMA系统中有多少核心或节点，您目前在新的分配(跨系统中的所有进程中)中硬限制在5-8GB/s之间，这完全受到单线程性能的限制。所有专用内存管理线程都实现了，而不是在竞争的关键部分浪费CPU周期。

你本以为微软会有一个在不同内核上分配页面的无锁策略，但显然根本就不是这样。

自旋锁在Windows7和更早的KiPageFault实现中也已经存在。那么是什么改变了呢？

简单的答案是：KiPageFault本身变得更慢了。不知道到底是什么原因导致它减速，但是自旋锁定从来没有成为一个明显的限制，因为100%的争论以前是不可能的。

如果有人哭哭啼啼地拆开KiPageFault，找出最昂贵的部分--做我的客人。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45024029

复制

相似问题

问与Windows 7相比，Windows 10性能较差(页面错误处理是不可扩展的，当没有线程>16时，会出现严重的锁定争用)
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问与Windows 7相比，Windows 10性能较差(页面错误处理是不可扩展的，当没有线程>16时，会出现严重的锁定争用)EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问与Windows 7相比，Windows 10性能较差(页面错误处理是不可扩展的，当没有线程>16时，会出现严重的锁定争用)
EN