blocks|key|986047|text|你检查过你的内存访问模式本身了吗？这可能是低效的-缓存不友好。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|986048|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Did you examine your memory access pattern itself? It might be inefficient - cache unfriendly.

blocks|key|988782|text|你有没有尝试在数组访问时使用原始指针？|type|unstyled|depth|inlineStyleRanges|entityRanges|data|988783|//+regular+place

for+(int+i+=+0;+i+<+arr.size();+%2B%2Bi)
++++wcout+<<+arr[i];

//+In+bottleneck

int+*pArr+=+&arr.front();

for+(int+i+=+0;+i+<+arr.size();+%2B%2Bi)
++++wcout+<<+pArr[i];|code-block|syntax|javascript|988784|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Did you try to use raw pointer while array accessing?

<pre><code>// regular place

for (int i = 0; i &lt; arr.size(); ++i)
 wcout &lt;&lt; arr[i];

// In bottleneck

int *pArr = &amp;arr.front();

for (int i = 0; i &lt; arr.size(); ++i)
 wcout &lt;&lt; pArr[i];
</code></pre>

blocks|key|5411811|text|我怀疑gprof阻止了函数的内联。尝试使用另一种分析方法。std::vector+operator+[]不能成为瓶颈，因为它与原始数组访问没有太大区别。SGI的实现如下所示：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|5411812|reference+operator[](size_type+__n)+{+return+*(begin()+%2B+__n);+}
iterator+begin()+{+return+_M_start;+}|code-block|syntax|javascript|5411813|entityMap^0|T|N|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|P|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Q|8|@]|D|@]|E|$]]]|L|$]]

I suspect that gprof prevents functions to be inlined. Try to use another profiling method. <code>std::vector operator []</code> cannot be bottleneck because it doesn't differ much from raw array access. SGI implementaion is shown below:

<pre><code>reference operator[](size_type __n) { return *(begin() + __n); }
iterator begin() { return _M_start; }
</code></pre>

blocks|key|5411765|text|对于高速代码分析，您不能信任gprof，相反，您应该使用像oprofile这样的被动分析方法来了解真实情况。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|5411766|作为另一种选择，您可以通过手动代码更改来进行性能分析(例如，调用计算10次而不是1次，并检查执行时间增加了多少)。请注意，这将受到缓存问题的影响，因此YMMV。|5411767|entityMap^0|E|5|T|8|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@$9|K|A|L|B|C]|$9|M|A|N|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|O|8|@]|D|@]|E|$]]|$1|H|3|-4|5|6|7|P|8|@]|D|@]|E|$]]]|I|$]]

You cannot trust <code>gprof</code> for high-speed code profiling, you should instead use a passive profiling method like <code>oprofile</code> to get the real picture.

As an alternative you could profile by manual code alteration (e.g. calling a computation 10 times instead of one and checking how much the execution time increases). Note that this is however going to be influenced by cache issues so YMMV.

blocks|key|988860|text|vector类非常受欢迎，并提供了一定的便利性，但以性能为代价，当您不是特别需要性能时，这是很好的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|988861|如果你真的需要性能，那么绕过向量类，直接转到一个简单的手工数组，无论是静态的还是动态的，都不会有太大的伤害。然后1)你现在花在索引上的时间基本上应该消失了，你的应用程序可以加速这个数量，2)你可以转移到你的应用程序中需要时间的“下一件大事”上。|988862|编辑:大多数程序的加速空间比你想象的要大得多。我制作了一个walk-through+project来说明这一点。如果我可以非常快速地总结它，它是这样的：|offset|length|988863|988864|的原始时间是每个“作业”2.7毫秒(“作业”的数量可以改变，以获得足够的运行时间来分析它)。|unordered-list-item|988865|First+cut显示大约60%25的时间花在向量操作上，包括索引、添加和删除。我用来自MFC的一个类似的向量类替换，时间减少到1.8毫秒/作业。(这是1.5倍或50%25的加速。)|988866|即使使用该数组类，也有大约40%25的时间花费在[]索引运算符上。我想让它直接索引，所以我强制它直接索引，而不是通过操作符函数。这将时间减少到1.5毫秒/作业，速度提高了1.2倍。|988867|现在大约60%25的时间是在数组中添加/删除项目。在“新建”和“删除”中花费了额外的分数。我决定丢弃数组并做两件事。一种是使用do链表，并将使用过的对象汇集在一起。第一次将时间缩短到1.3毫秒(1.15x)。第二次将其减少到0.44毫秒(2.95x)。|988868|当时，我发现大约60%25的时间是在我编写的对列表进行索引的代码中(就像它是一个数组一样)。我决定只需将指针直接指向列表，就可以做到这一点。结果:+0.14毫秒(3.14倍)。|988869|现在我发现几乎所有的时间都花在了打印到控制台的诊断I/O行上。我决定去掉它:+0.0037毫秒(38倍)。|988870|988871|我本可以继续走下去，但我停了下来。每项工作的总时间减少了大约700倍。|988872|我想要你接受的是，如果你需要的性能足够糟糕，以至于偏离了可能被认为是被接受的做事方式，你不必在遇到一个“瓶颈”后停下来。仅仅因为你获得了很大的加速并不意味着没有更多的加速。事实上，就加速因素而言，下一个“瓶颈”可能比第一个更大。所以提高你对加速的期望值，然后全力以赴。|988873|entityMap|0|LINK|mutability|MUTABLE|url|https://sourceforge.net/projects/randompausedemo/files/^0|0|0|T|K|0|0|0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|18|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|19|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|1A|8|@]|9|@$F|1B|G|1C|1|1D]]|A|$]]|$1|H|3|-4|5|6|7|1E|8|@]|9|@]|A|$]]|$1|I|3|J|5|K|7|1F|8|@]|9|@]|A|$]]|$1|L|3|M|5|K|7|1G|8|@]|9|@]|A|$]]|$1|N|3|O|5|K|7|1H|8|@]|9|@]|A|$]]|$1|P|3|Q|5|K|7|1I|8|@]|9|@]|A|$]]|$1|R|3|S|5|K|7|1J|8|@]|9|@]|A|$]]|$1|T|3|U|5|K|7|1K|8|@]|9|@]|A|$]]|$1|V|3|-4|5|6|7|1L|8|@]|9|@]|A|$]]|$1|W|3|X|5|6|7|1M|8|@]|9|@]|A|$]]|$1|Y|3|Z|5|6|7|1N|8|@]|9|@]|A|$]]|$1|10|3|-4|5|6|7|1O|8|@]|9|@]|A|$]]]|11|$12|$5|13|14|15|A|$16|17]]]]

The vector class is heavily liked and provides a certain amount of convenience, at the expense of performance, which is fine when you don't particularly need performance.

If you really need performance, it won't hurt you too much to bypass the vector class and go directly to a simple old hand-made array, whether statically or dynamically allocated. Then 1) the time you currently spend indexing should essentially disappear, speeding up your app by that amount, and 2) you can move on to whatever the "next big thing" is that takes time in your app.

EDIT:
Most programs have a lot more room for speedup than you might suppose. I made a <a href="https://sourceforge.net/projects/randompausedemo/files/" rel="nofollow">walk-through project</a> to illustrate this. If I can summarize it really quickly, it goes like this:

<ul>
<li>Original time is 2.7 msec per "job" (the number of "jobs" can be varied to get enough run-time to analyze it).</li>
<li>First cut showed roughly 60% of time was spent in vector operations, including indexing, appending, and removing. I replaced with a similar vector class from MFC, and time decreased to 1.8 msec/job. (That's a 1.5x or 50% speedup.)</li>
<li>Even with that array class, roughly 40% of time was spent in the [] indexing operator. I wanted it to index directly, so I forced it to index directly, not through the operator function. That reduced time to 1.5 msec/job, a 1.2x speedup.</li>
<li>Now roughly 60% of the time is adding/removing items in arrays. An additional fraction was spent in "new" and "delete". I decided to chuck the arrays and do two things. One was to use do-it-yourself linked lists, and to pool used objects. The first reduced time to 1.3 msec (1.15x). The second reduced it to 0.44 msec (2.95x).</li>
<li>Of that time, I found that about 60% of the time was in code I had written to do indexing into the list (as if it were an array). I decided that could be done instead just by having a pointer directly into the list. Result: 0.14 msec (3.14x).</li>
<li>Now I found that nearly all the time was being spent in a line of diagnostic I/O I was printing to the console. I decided to get rid of that: 0.0037 msec (38x).</li>
</ul>

I could have kept going, but I stopped.
The overall time per job was reduced by a compounded factor of about 700x.

What I want you to take away is if you need performance bad enough to deviate from what might be considered the accepted ways of doing things, you don't have to stop after one "bottleneck".
Just because you got a big speedup doesn't mean there are no more.
In fact the next "bottleneck" might be bigger than the first, in terms of speedup factor.
So raise your expectations of speedup you can get, and go for broke.

gprof says that my high computing app spends 53% of its time inside <code>std::vector &lt;...&gt; operator [] (unsigned long)</code>, 32% of which goes to one heavily used vector. Worse, I suspect that my parallel code failing to scale beyond 3-6 cores is due to a related memory bottleneck. While my app does spend a lot of time accessing and writing memory, it seems like I should be able (or at least try) to do better than 52%. Should I try using dynamic arrays instead (size remains constant in most cases)? Would that be likely to help with possible bottlenecks? 

Actually, my preferred solution would be to solve the bottleneck and leave the vectors as is for convenience. Based on the above, are there any likely culprits or solutions (tcmalloc is out)?

optimizing `std::vector operator []` (vector access) when it becomes a bottleneck

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

EdgeOne AI 安全实战专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

gprof说，我的高计算应用程序在std::vector <...> operator [] (unsigned long)中花费了53%的时间，其中32%花在了一个频繁使用的向量上。更糟糕的是，我怀疑我的并行代码无法扩展到超过3-6个核心，这是由于相关的内存瓶颈。虽然我的应用程序确实花了很多时间访问和写入内存，但似乎我应该能够(或至少尝试)做得比52%更好。我是否应该尝试使用动态数组(大多数情况

问在成为瓶颈时优化` `std::vector operator []` (向量访问)
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在成为瓶颈时优化` `std::vector operator []` (向量访问)EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在成为瓶颈时优化` `std::vector operator []` (向量访问)
EN