文章/答案/技术大牛

发布

社区首页 >问答首页 >在对数据集执行操作时，for循环的性能含义是什么？

问在对数据集执行操作时，for循环的性能含义是什么？
EN

Stack Overflow用户

提问于 2020-07-23 08:55:58

回答 1查看 111关注 0票数 0

考虑正在完成的工作与这两种代码变体之间的时间之间的差异：

任务： A[i] *= (C[i] + D[i]) AND B[i] *= (D[i] - C[i])

Test案例1:使用向量的向量使用2,000,000元素，其中向量对象在stack
Test上.2:使用向量的使用2,000,000元素，其中向量位于动态内存

中

Case1: -Single for循环对两个对象依次执行操作-

用于(i~ n)：Ai *= (Ci + Di) Bi *= (Di - Ci)

Case2: -Two为循环独立地执行对每个对象的操作-

for(i至n)：Ai *= (Ci + Di) for(i to n)：Ai *= (Di - Ci)

代码： -Test Case 1:堆栈上的向量-

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <exception>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <limits>
#include <random>
#include <sstream>
#include <string_view>
#include <utility>
#include <vector>

template<class Resolution = std::chrono::milliseconds>
class ExecutionTimer {
public:
    using Clock = std::conditional_t<std::chrono::high_resolution_clock::is_steady,
        std::chrono::high_resolution_clock,
        std::chrono::steady_clock>;
private:
    const Clock::time_point mStart = Clock::now();

public:
    ExecutionTimer() = default;
    ~ExecutionTimer() {
        const auto end = Clock::now();
        std::ostringstream strStream;
        strStream << "Destructor Elapsed: "
            << std::chrono::duration_cast<Resolution>(end - mStart).count()
            << std::endl;
        std::cout << strStream.str() << std::endl;
    }

    inline void stop() {
        const auto end = Clock::now();
        std::ostringstream strStream;
        strStream << "Stop Elapsed: "
            << std::chrono::duration_cast<Resolution>(end - mStart).count()
            << std::endl;
        std::cout << strStream.str() << std::endl;
    }
};

template<typename T>
void randomly_populate_vector(std::vector<T>& vec, std::mt19937& gen, T lower, T upper) {
    std::uniform_int_distribution<T> dist{ lower, upper };
    for (auto& e : vec) { e = dist(gen); }
}

// this function accounts for 0 base indexing and check max bounds... 
// if user enters 1 for lower it will index location 0 in the vector
template<typename T>
void view_vector(const std::vector<T> vec, std::string_view name, size_t lower = 0, size_t upper = std::numeric_limits<size_t>::max()) {
    if (lower == 0) lower = 1;
    if (upper > vec.size()) upper = vec.size();
 
    std::cout << name << " { ";
    for (size_t i = lower - 1; i < upper; i++) {
        std::cout << std::setw(3) << vec[i] << ", ";
    }
    std::cout << "... }\n";
}

int main() {
    try {
        std::random_device rd;
        std::mt19937 gen{ rd() };
        const std::size_t vec_size = 20000000;
        constexpr uint32_t lower = 1, upper = 100;

        std::vector<uint32_t> A( vec_size, 0 );
        randomly_populate_vector<uint32_t>(A, gen, lower, upper);

        std::vector<uint32_t> B( vec_size, 0 );
        randomly_populate_vector<uint32_t>(B, gen, lower, upper);

        std::vector<uint32_t> C( vec_size, 0 );
        randomly_populate_vector<uint32_t>(C, gen, lower, upper);

        std::vector<uint32_t> D( vec_size, 0 ); 
        randomly_populate_vector<uint32_t>(D, gen, lower, upper);

        // Just a sanity check to make sure we have populated arrays with different values...
        view_vector(A, "A", 1, 20);
        view_vector(B, "B", 1, 20);
        view_vector(C, "C", 1, 20);
        view_vector(D, "D", 1, 20);        

        std::cout << std::endl;

        // Test Case 1:      
        std::cout << "Test Case 1 Time Profile:\n";
        {
            // Time Profile:
            ExecutionTimer<std::chrono::milliseconds> timer;
            // Test Case 1: 
            for (uint32_t i = 0; i < vec_size; i++) {
                A[i] *= (C[i] + D[i]);
                B[i] *= (D[i] - C[i]);
            }
            timer.stop();
        }        
               
        // Test Case 2:
        std::cout << "Test Case 2 Time Profile:\n";
        {
            // Time Profile:
            ExecutionTimer<std::chrono::milliseconds> timer;
            for (uint32_t i = 0; i < vec_size; i++) {
                A[i] *= (C[i] + D[i]);
            }
            for (uint32_t i = 0; i < vec_size; i++) {
                B[i] *= (D[i] - C[i]);
            }
            timer.stop();
        }

        std::cout << std::endl;
        view_vector(A, "A", 1, 20);
        view_vector(B, "B", 2, 20);

    }
    catch (const std::exception& e) {
        std::cerr << e.what() << std::endl;
        return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
}

输出

A {  32,  21,  90,  61,   5,  60,  49,  93,   7,  53,  57,  99,  52,  35,  65,   1,  12,  87,  62,  14, ... }
B {   1,  38,  12,   7,   4,  77,  14,  36,  93,  32,  11,  76,  54,  79,  53,  53,   8,  87,  51,  16, ... }
C {   6,  37,  97,  87,   2,   5,  38,  65,  57,  87,  36,  48,  97,  74,  35,  97,  98,   7,  65,  79, ... }
D {  54,   4,  96,  56,  24,  21,  25,  84,  93,  43,  89,  42,  64,  84,  45,  77,  60,  67,  89,  13, ... }

Test Case 1 Time Profile:
Stop Elapsed: 43014

Destructor Elapsed: 43014

Test Case 2 Time Profile:
Stop Elapsed: 43251

Destructor Elapsed: 43251


A { 115200, 35301, 3352410, 1247389, 3380, 40560, 194481, 2064693, 157500, 895700, 890625, 801900, 1347892, 873740, 4160
00, 30276, 299568, 476412, 1470392, 118496, ... }
B { 41382,  12, 6727, 1936, 19712, 2366, 12996, 120528, 61952, 30899, 2736, 58806, 7900, 5300, 21200, 11552, 313200, 293
76, 69696, ... }

代码 -Test Case 2:动态内存中的向量

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <exception>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <limits>
#include <memory>
#include <random>
#include <sstream>
#include <string_view>
#include <utility>
#include <vector>

template<class Resolution = std::chrono::milliseconds>
class ExecutionTimer {
public:
    using Clock = std::conditional_t<std::chrono::high_resolution_clock::is_steady,
        std::chrono::high_resolution_clock,
        std::chrono::steady_clock>;
private:
    const Clock::time_point mStart = Clock::now();

public:
    ExecutionTimer() = default;
    ~ExecutionTimer() {
        const auto end = Clock::now();
        std::ostringstream strStream;
        strStream << "Destructor Elapsed: "
            << std::chrono::duration_cast<Resolution>(end - mStart).count()
            << std::endl;
        std::cout << strStream.str() << std::endl;
    }

    inline void stop() {
        const auto end = Clock::now();
        std::ostringstream strStream;
        strStream << "Stop Elapsed: "
            << std::chrono::duration_cast<Resolution>(end - mStart).count()
            << std::endl;
        std::cout << strStream.str() << std::endl;
    }
};

template<typename T>
void randomly_populate_vector(std::vector<T>& vec, std::mt19937& gen, T lower, T upper) {
    std::uniform_int_distribution<T> dist{ lower, upper };
    for (auto& e : vec) { e = dist(gen); }
}

// this function accounts for 0 base indexing and check max bounds... 
// if user enters 1 for lower it will index location 0 in the vector
template<typename T>
void view_vector(const std::vector<T> vec, std::string_view name, size_t lower = 0, size_t upper = std::numeric_limits<size_t>::max()) {
    if (lower == 0) lower = 1;
    if (upper > vec.size()) upper = vec.size();
 
    std::cout << name << " { ";
    for (size_t i = lower - 1; i < upper; i++) {
        std::cout << std::setw(3) << vec[i] << ", ";
    }
    std::cout << "... }\n";
}

int main() {
    try {
        std::random_device rd;
        std::mt19937 gen{ rd() };
        const std::size_t vec_size = 20000000;
        constexpr uint32_t lower = 1, upper = 100;

        std::unique_ptr<std::vector<uint32_t>> A( new std::vector<uint32_t>( vec_size, 0 ) );
        randomly_populate_vector<uint32_t>(*A.get(), gen, lower, upper);

        std::unique_ptr<std::vector<uint32_t>> B( new std::vector<uint32_t>(vec_size, 0 ) );
        randomly_populate_vector<uint32_t>(*B.get(), gen, lower, upper);

        std::unique_ptr<std::vector<uint32_t>> C(new std::vector<uint32_t>(vec_size, 0));
        randomly_populate_vector<uint32_t>(*C.get(), gen, lower, upper);

        std::unique_ptr<std::vector<uint32_t>> D(new std::vector<uint32_t>(vec_size, 0));
        randomly_populate_vector<uint32_t>(*D.get(), gen, lower, upper);

        // Just a sanity check to make sure we have populated arrays with different values...
        view_vector(*A.get(), "A", 1, 20);
        view_vector(*B.get(), "B", 1, 20);
        view_vector(*C.get(), "C", 1, 20);
        view_vector(*D.get(), "D", 1, 20);        

        std::cout << std::endl;

        // Test Case 1:      
        std::cout << "Test Case 1 Time Profile:\n";
        {
            // Time Profile:
            ExecutionTimer<std::chrono::milliseconds> timer;
            // Test Case 1: 
            for (uint32_t i = 0; i < vec_size; i++) {
                (*A.get())[i] *= ((*C.get())[i] + (*D.get())[i]);
                (*B.get())[i] *= ((*D.get())[i] - (*C.get())[i]);
            }
            timer.stop();
        }        
               
        // Test Case 2:
        std::cout << "Test Case 2 Time Profile:\n";
        {
            // Time Profile:
            ExecutionTimer<std::chrono::milliseconds> timer;
            for (uint32_t i = 0; i < vec_size; i++) {
                (*A.get())[i] *= ((*C.get())[i] + (*D.get())[i]);
            }
            for (uint32_t i = 0; i < vec_size; i++) {
                (*B.get())[i] *= ((*D.get())[i] - (*C.get())[i]);
            }
            timer.stop();
        }

        std::cout << std::endl;
        view_vector(*A.get(), "A", 1, 20);
        view_vector(*B.get(), "B", 2, 20);

    }
    catch (const std::exception& e) {
        std::cerr << e.what() << std::endl;
        return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
}

输出

A {  40,  84,  42,  66,  22,  70,  32,  51,  48,  13,  59,  28,  69,  59,  13,  86,  85,  33,  59,  11, ... }
B {  80,  76,  57,  21,  24,  21,  74,  14,  53,  75,   5,  96,   4,  47,  79,  73,  92,   4,  90,   3, ... }
C {  30,  49,  96,  46,  16,  46,  22,  46,  26,  83,  83,  23,  84,  70,  63,  69,  48,  32,  92,   8, ... }
D {  59,  90,  74,   2,  77,  84,  35,  81,  58,  78,  57,  40,  54,   8,  74,  15,  11,  70,  81,  27, ... }

Test Case 1 Time Profile:
Stop Elapsed: 54735

Destructor Elapsed: 54736

Test Case 2 Time Profile:
Stop Elapsed: 54098

Destructor Elapsed: 54099


A { 316840, 1622964, 1213800, 152064, 190278, 1183000, 103968, 822579, 338688, 336973, 1156400, 111132, 1314036, 358956,
 243997, 606816, 295885, 343332, 1765811, 13475, ... }
B { 127756, 27588, 40656, 89304, 30324, 12506, 17150, 54272, 1875, 3380, 27744, 3600, 180668, 9559, 212868, 125948, 5776, 10890, 1083, ... }

下面的输出是使用VisualStudio2017在没有优化的调试x64中使用x64生成的。

从两个测试用例的样本测试中可以看出，即使两百万个样本在每个测试用例上执行2-3个操作，时间变化也比较接近.乘法、加法、减法和赋值。

时间配置文件在我的英特尔核心2二重奏四核极限与3.0Ghz处理器与8GB的Ram在Windows 7-64位系统.我得到了这么多的价值..。

case1 case2 Stack: 43014 43251 Heap: 54735 54098

在第一个例子中，当向量在堆栈上时，具有多个循环的情况稍微慢一些，但是即使在2000种情况下，每个元素上只有少数几个操作，我们也可以认为这是可以忽略不计的。

即使在第二个例子中，向量本身也在堆上或动态内存中.两个测试用例之间的差异仍然相对相似，但是在这种情况下，使用多个for循环的情况2表现得更好。

我怀疑，如果对这些对象的计算、函数调用或内存操作在对应的for循环中更加复杂，这种趋势会继续下去。

我还没有在发布模式下运行这两种场景，并在.但是，即使在调试模式下，我们也可以通过将正在完成的对象上的工作分离为不同的for循环来提高性能，而不是试图在单个循环中完成所有这些工作。

现在，这些向量中的实际元素只是简单的积分值。想象一下，如果这些是复杂的对象或结构，它们可能正在调用某个函数，并且该函数可能有多个分支，或者这些对象中的一个正在从队列中查询，或者正在等待一个线程.我倾向于认为，随着时间的推移，情况1会变得更糟，随着时间的推移，对象越复杂，容器的大小，以及在同一个循环中处理的其他对象或容器.

有单独的循环似乎有违直觉，但我相信这可能是一种性能上的提高。现在，如果您的数据集很小，而且数据结构很简单，那么这并不一定适用，因为时差几乎可以忽略不计，但是如果您正在对具有复杂操作(如声音处理中所见的)的大型数据集进行数字处理，那么这可能是要考虑的问题！

我已经知道了原因，但我想听听社会各界对此的反应、立场和理解.你觉得根本原因是什么？

编辑

下面是在发布模式下运行的打印结果，并在.

案例1

A {  31,  57,  72,  19,  12,   3,  43,  11,  66,  46,  25,  74,  72,  87,  22,  31,  78,  80,  12,  94, ... }
B {  60,  26,  69,  58,   2,  15,  74, 100,  61,  44,  94,  94,  48,  35,  32,   6,  74,  52,   4,  40, ... }
C {  28,  81,  98,  31,  65,  39,  97,  31,  89,  64,  35,  47,  35,  37,  79,  20,  86,  70,  80,  43, ... }
D {  46,  36,  69,  35,  24,  62,  11,  41,  36,  17,  91,  34,  94,  70,   7,  66,  32,  98,  18,  21, ... }

Test Case 1 Time Profile:
Stop Elapsed: 144

Destructor Elapsed: 144

Test Case 2 Time Profile:
Stop Elapsed: 106

Destructor Elapsed: 106


A { 169756, 780273, 2008008, 82764, 95052, 30603, 501552, 57024, 1031250, 301806, 396900, 485514, 1198152, 996063, 16271
2, 229276, 1086072, 2257920, 115248, 385024, ... }
B { 52650, 58029, 928, 3362, 7935, 547304, 10000, 171349, 97196, 294784, 15886, 167088, 38115, 165888, 12696, 215784, 40
768, 15376, 19360, ... }

案例2

A {  31,  68,  94,  29,  10,  32,  37,  11,  50,   6,  18,  62,  88,  36,  10,  48,  19,  38,  49,   1, ... }
B {   8,  19,   8,   3,  26,  97,  80,  82,  54,  31,  80,  95,  68,  84,  82,   1,  59,  61,  60,  35, ... }
C {  80,  69,  14,  59,  42,  85,  14,  40,  52,  35,  10,  28,  82,  62,  84,  97,  70,   1,  54,  92, ... }
D {  74,  27,  46,  15,  13,   5,  17,  60,  33,  25,  59,  84,  53,  14,  98,  55,  62,  49,  42,  45, ... }

Test Case 1 Time Profile:
Stop Elapsed: 139

Destructor Elapsed: 139

Test Case 2 Time Profile:
Stop Elapsed: 104

Destructor Elapsed: 104


A { 735196, 626688, 338400, 158804, 30250, 259200, 35557, 110000, 361250, 21600, 85698, 777728, 1603800, 207936, 331240,
 1108992, 331056, 95000, 451584, 18769, ... }
B { 33516, 8192, 5808, 21866, 620800, 720, 32800, 19494, 3100, 192080, 297920, 57188, 193536, 16072, 1764, 3776, 140544,
 8640, 77315, ... }

在发布模式和优化的情况下，我们可以看到以下内容：

case1 case2 Stack: 144 106 Heap: 139 104

现在我们可以看到情况2，独立循环表现得更好。再说一遍，你认为是什么?为什么？

c++

performance

for-loop

performance-testing

nested-loops

回答 1

Stack Overflow用户

发布于 2020-07-23 10:00:40

人们开始在语义学上与我争论，而不是阅读这个问题。是的，我提供的样本集非常小，在这个简单的测试用例中差别很小。

当我说简单时，我指的是容器中的数据类型是简单整数，而不是复杂的结构，这些结构可能调用或不调用其他函数、管理或访问内存、位于优先级队列中、等待线程等。

这里有一个基本的现象，为什么在单个数据集上工作的独立for循环将比在多个数据集上工作的单个for循环执行得更好。

我知道这是为什么，我会给你一个提示或线索。它甚至不涉及或涉及到任何硬件或软件。这里的问题是算法问题。这是一个基于数学物理的问题.

我不会在这里详细说明为什么这是我之前已经说过的，作为对Stackoverflow上的另一个问题的回答。我只是提供我的答案的链接。你可以找到它here！

编辑

给出那篇文章的简短版本。

我们必须考虑它们之间的相对位置。我们必须考虑他们的范围。

在第一种情况下，对多个数据集执行单个for和执行操作：A位于x，而B位于内存中的y，无论它在缓存、内存中还是在其他地方。我们知道x和y之间有一定的距离。

范围内的代码执行必须先执行到A，然后等待C和D完成任务才能将值返回到A，然后执行必须从A到B，然后在C和D上再次等待，以完成任务，将值返回到B。然后在下一次迭代中，它必须从B返回到A。这必须在for循环的每一次迭代中执行。

现在让我们考虑第二种情况，即每个数据集都有一个独立的循环.

对于这个循环的范围，代码执行必须迁移到A，然后必须等待C和D完成它们的任务，才能将它们的值返回给A。代码执行不必从这里移动，因为它已经是@ A了。现在，我们只在循环的每一次迭代中重复这一点。一旦该循环超出范围，执行将继续到下一个循环。它只需从A运行一次到B，然后等待C & D完成任务，将其值返回给A，而且这里也没有移动，因为它已经在A了，我们只是继续循环的迭代。

所以当我们看这两种循环结构时：

for(i to n) {
    A[i] = C[i] op D[i]; // execution has to move from A's location 
    B[i] = C[i] op D[i]; // to B's location and back to A's on every iteration.
}

这将造成瓶颈。当处理本地堆栈上的数据对象和数据集时，以及当它们的值被频繁缓存时，它是边际的，几乎可以忽略不计。然而，当这些结构变得更加复杂，并且内存不再在缓存中(关闭和快速)，并且现在在主内存中，它可能是几百个地址到数十亿个地址之外。最后，这里有一个内插问题，从这个循环中的第一行执行到第二个循环。

这里的时差是线性的，因为它会随着容器的尺寸增长而线性增长，因为它与容器的大小成正比。为了防止这种情况，只需将每个独立数据集移动到它们自己的独立循环和作用域。

for ( i to n ) {
    A[i] = C[i] op D[i] // Travels to A just once, then continues iteration.
}      

for ( i to n ) {
    B[i] = C[i] op D[i] // Travels from A to B just once, then continues iteration.
}

这里的根本问题与硬件和软件毫无关系。他们会玩吗？当然了!但是从算法的数学和实际运动的物理上看，这种插值瓶颈是显而易见的。然而，在拥有现代操作系统和编译器的现代计算机系统中.它们变得如此先进，以至于它们可以在幕后为您生成适当的“组装”指令，以防止或最小化这样的代码结构，但是情况可能并不总是如此！我只是认为，在设计和实现算法时，特别是在库或API代码的情况下，这是值得有意识的事情。您不知道您的用户将如何使用它。

票数 -1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63050515

复制

相似问题

问在对数据集执行操作时，for循环的性能含义是什么？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在对数据集执行操作时，for循环的性能含义是什么？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在对数据集执行操作时，for循环的性能含义是什么？
EN