文章/答案/技术大牛

发布

社区首页 >问答首页 >C++线程应用程序运行速度慢于非线程应用程序。

问C++线程应用程序运行速度慢于非线程应用程序。
EN

Stack Overflow用户

提问于 2016-05-15 20:36:15

回答 4查看 1K关注 0票数 4

我目前正在用C++编写一个素数生成器。我先做了一个单线程版本，然后做了一个多线程版本.

我发现，如果我的程序生成的值小于100'000，那么单线程版本比多线程版本要快。显然我做错了什么。

我的代码如下：

#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <thread>
#include <mutex>
#include <shared_mutex>

using namespace std;

set<unsigned long long> primeContainer;
shared_mutex m;

void checkPrime(const unsigned long long p)
{
    if (p % 3 == 0)
        return;

    bool isPrime = true;
    for (set<unsigned long long>::const_iterator it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
    {
        if (p % *it == 0)
        {
            isPrime = false;
            break;
        }
        if (*it * *it > p) // check only up to square root
            break;
    }

    if (isPrime)
        primeContainer.insert(p);
}

void checkPrimeLock(const unsigned long long p)
{
    if (p % 3 == 0)
        return;

    bool isPrime = true;
    try
    {
        shared_lock<shared_mutex> l(m);
        for (set<unsigned long long>::const_iterator it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
        {
            if (p % *it == 0)
            {
                isPrime = false;
                break;
            }
            if (*it * *it > p)
                break;
        }
    }
    catch (exception& e)
    {
        cout << e.what() << endl;
        system("pause");
    }

    if (isPrime)
    {
        try
        {
            unique_lock<shared_mutex> l(m);
            primeContainer.insert(p);
        }
        catch (exception& e)
        {
            cout << e.what() << endl;
            system("pause");
        }
    }
}

void runLoopThread(const unsigned long long& l)
{
    for (unsigned long long i = 10; i < l; i += 10)
    {
        thread t1(checkPrimeLock, i + 1);
        thread t2(checkPrimeLock, i + 3);
        thread t3(checkPrimeLock, i + 7);
        thread t4(checkPrimeLock, i + 9);
        t1.join();
        t2.join();
        t3.join();
        t4.join();
    }
}

void runLoop(const unsigned long long& l)
{
    for (unsigned long long i = 10; i < l; i += 10)
    {
        checkPrime(i + 1);
        checkPrime(i + 3);
        checkPrime(i + 7);
        checkPrime(i + 9);
    }
}

void printPrimes(const unsigned long long& l)
{
    if (1U <= l)
        cout << "1 ";
    if (2U <= l)
        cout << "2 ";
    if (3U <= l)
        cout << "3 ";
    if (5U <= l)
        cout << "5 ";

    for (auto it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
    {
        if (*it <= l)
            cout << *it << " ";
    }
    cout << endl;
}

void writeToFile(const unsigned long long& l)
{
    string name = "primes_" + to_string(l) + ".txt";
    ofstream f(name);

    if (f.is_open())
    {
        if (1U <= l)
            f << "1 ";
        if (2U <= l)
            f << "2 ";
        if (3U <= l)
            f << "3 ";
        if (5U <= l)
            f << "5 ";

        for (auto it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
        {
            if (*it <= l)
                f << *it << " ";
        }
    }
    else
    {
        cout << "Error opening file." << endl;
        system("pause");
    }
}

int main()
{
    unsigned int n = thread::hardware_concurrency();
    std::cout << n << " concurrent threads are supported." << endl;

    unsigned long long limit;
    cout << "Please enter the limit of prime generation: ";
    cin >> limit;

    primeContainer.insert(7);

    if (10 < limit)
    {
        //runLoop(limit); //single-threaded
        runLoopThread(limit); //multi-threaded
    }

    printPrimes(limit);
    //writeToFile(limit);
    system("pause");
    return 0;
}

在main函数中，您将看到关于哪个函数是单线程和多线程的注释。

它们之间的主要区别在于锁的使用，用于容器迭代的共享锁，以及用于插入的唯一锁。如果有关系的话，我的CPU有4个核心。

为什么单线程版本更快？

c++

multithreading

c++11

回答 4

Stack Overflow用户

回答已采纳

发布于 2016-05-15 21:00:03

在我看来，你似乎正在为每一个质数检查启动一个新线程。这不是一个好的IMHO，因为线程的启动/关闭加上同步增加了每个素数的计算。启动线程可能会非常慢。

我建议在主for循环之外启动这4个线程，并在每个线程中处理范围的1/4。但是这可能需要一些额外的同步，因为要检查素数，上面的代码显然首先需要有最多可使用的sqrt的素数。

从我的观点来看，使用伊拉斯多梯尼筛算法可能更容易，它可能更容易并行化，而不需要任何锁定(然而，可能仍然会遇到称为“虚假共享”的问题)。

编辑

在这里，我使用Erastothenes的筛子快速创建了一个版本：

void processSieve(const unsigned long long& l,
    const unsigned long long& start,
    const unsigned long long& end,
    const unsigned long long& step,
    vector<char> &is_prime)
{
    for (unsigned long long i = start; i <= end; i += step)
        if (is_prime[i])
            for (unsigned long long j = i + i; j <= l; j += i)
                is_prime[j] = 0;
}

void runSieve(const unsigned long long& l)
{
    vector<char> is_prime(l + 1, 1);
    unsigned long long end = sqrt(l);
    processSieve(l, 2, end, 1, is_prime);
    primeContainer.clear();
    for (unsigned long long i = 1; i <= l; ++i)
        if (is_prime[i])
            primeContainer.insert(i);
}

void runSieveThreads(const unsigned long long& l)
{
    vector<char> is_prime(l + 1, 1);
    unsigned long long end = sqrt(l);
    vector<thread> threads;
    threads.reserve(cpuCount);
    for (unsigned long long i = 0; i < cpuCount; ++i)
        threads.emplace_back(processSieve, l, 2 + i, end, cpuCount, ref(is_prime));
    for (unsigned long long i = 0; i < cpuCount; ++i)
        threads[i].join();
    primeContainer.clear();
    for (unsigned long long i = 1; i <= l; ++i)
        if (is_prime[i])
            primeContainer.insert(i);
}

衡量结果，最高可达1 000 000 (MSVC，2013年，发布)：

runLoop: 204.02 ms
runLoopThread: 43947.4 ms
runSieve: 30.003 ms
runSieveThreads (8 cores): 24.0024 ms

最高可达1万人：

runLoop: 4387.44 ms
// runLoopThread disabled, taking too long
runSieve: 350.035 ms
runSieveThreads (8 cores): 285.029 ms

时间包括向量的最终处理和将结果推到素数集。

如您所见，即使在单线程版本中，筛网版本也比您的版本快得多(对于您的互斥锁版本，我不得不将锁更改为常规互斥锁，因为MSVC 2013没有shared_lock，因此结果可能比您的要糟糕得多)。

但是您可以看到，多线程版本的筛子仍然没有像预期的那样快速运行(8个核心，即8个线程，线性加速比将比单个线程快8倍)，尽管没有锁定(如果其他线程尚未将某些数字标记为“无素数”，则可能会不必要地运行一些数字，但通常结果应该是稳定的，因为每次只设置为0，如果由多个线程同时设置则不重要)。加速比之所以不是线性的原因很可能是因为前面提到的"虚假共享“问题--写入零的线程使彼此的缓存行无效。

票数 4

Stack Overflow用户

发布于 2016-05-15 20:59:32

你有几个问题。

首先，您一直在不必要地创建和销毁线程。让每个线程循环进行工作，直到没有更多的工作要做。

其次，你的锁太细了，结果，你获得它们的次数太多了。让每个线程抓取一个100个数字的块，而不是一次测试一个，并让它们一次从每个块中插入已找到的素数。

票数 10

Stack Overflow用户

发布于 2016-05-16 10:35:57

由于注释部分有点拥挤，OP表示对无锁解决方案感兴趣，因此我在下面提供了这样一种方法的示例(半伪代码)：

vector<uint64_t> primes_thread1;
vector<uint64_t> primes_thread2;
...

// check all numbers in [start, end)
void check_primes(uint64_t start, uint64_t end, vector<uint64_t> & out) {
    for (auto i = start; i < end; ++i) {
        if (is_prime(i)) { // simply loop through all odds from 3 to sqrt(i)
            out.push_back(i);
        }
    }
}

auto f1 = async(check_primes, 1, 1000'000, ref(primes_thread1));
auto f2 = async(check_primes, 1000'000, 2000'000, ref(primes_thread2));
...

f1.wait();
f2.wait();
...

primes_thread1.insert(
    primes_thread1.begin(),
    primes_thread2.cbegin(), primes_thread2.cend()
);
primes_thread1.insert(
    primes_thread1.begin(),
    primes_thread3.cbegin(), primes_thread3.cend()
);
...
// primes_thread1 contains all primes found in all threads

显然，通过参数化线程数和每个范围的大小，可以很好地重构。我正在详细地(希望)更清楚地说明通过不共享任何状态来避免锁定的概念。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37243424

复制

相似问题

问C++线程应用程序运行速度慢于非线程应用程序。
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问C++线程应用程序运行速度慢于非线程应用程序。EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问C++线程应用程序运行速度慢于非线程应用程序。
EN