文章/答案/技术大牛

发布

社区首页 >问答首页 >从文本文件中高效读取具有字符串索引的大型二维数组(矩阵)

问从文本文件中高效读取具有字符串索引的大型二维数组(矩阵)
EN

Stack Overflow用户

提问于 2019-08-19 18:12:27

回答 1查看 478关注 0票数 2

首先，我知道有许多高度相关的问题，但我的第一次实现(基于这些Q&Q的一些建议)不够有效。

我正在寻找一种方法来(显着地)改进我第一次从输入文本文件中读取带有字符串索引的庞大(>10000x10000)非对称非稀疏二维数组(矩阵)的实现。另外，假设我们事先不知道矩阵的大小。

外部输入文件的结构(如任意两个位置之间的距离矩阵一样)如下所示：

  A   B   C   D   E   F   G
A 0   10  20  30  40  50  60
B 15  0   25  35  45  55  65
C 20  30  0   40  50  60  70
D 25  35  45  0   65  75  85
E 15  20  25  35  0   55  65
F 20  30  40  50  60  0   70
G 35  45  55  65  75  85  0

目前，我想出了以下解决方案：

std::map<std::string, std::map<std::string, int>> 
ReadDistancesFromFile(const char *name) {
  std::string filename(name);

  std::clog << "Trying to open and read: " << filename << std::endl;
  std::ifstream file(name);

  /// If .is_open() returns False, perror prints the error code stored in errno
  if (!file.is_open())
    std::perror(("Error while opening file " + filename).c_str());

  /// Map of maps to save all read distances
  std::map<std::string, std::map<std::string, int>> distances;

  /* 1. Is such an efficient structure (container) for my purpose:
        a) to store data efficiently
        b) to access data using indices quickly?
        c) to update values time after time
        d) insertion/deletion of new elements doesn't happen often */

  /// Vector to store all `String` type indices
  std::vector<std::string> indices;

  /// String to store index (location name)
  std::string index;

  /// Store line from the external file
  std::string line;

  /// Read the first line containing all String indices (location names)
  std::getline(file, line);

  std::istringstream iss(line);

  /// Process the first line: save all location names into `indices` vector
  while (iss >> index) {
    indices.push_back(index);
  }

  /* 2. Probably I could use .reserve() before the while loop?
        The problem that I don't know the size in advance. */

  /// Read the file via std::getline(). Rules obeyed:
  ///   - first the I/O operation, then error check, then data processing
  ///   - failbit and badbit prevent data processing, eofbit does not
  while (std::getline(file, line)) {
    std::istringstream is(line);

    /* 3. Is it efficient to define a stringstream variable inside a loop? */

    /// For each new line (matrix row), read the first String element (location name)
    is >> index;

    int distance;     // To store distance value
    uint column = 0;  // Column number to access location names from `indices` vector

    /// Process the line further: store Int distances from the input stream
    while (is >> distance) {
      distances[index][indices[column++]] = distance;
    }
  }

  /// Only in case of set badbit we are sure that errno has been set
  /// Use perror() to print error details
  if (file.bad())
    std::perror(("Error while reading file " + filename).c_str());

  /// close file
  file.close();

  /// With C++11, std::map has move-semantics, which means the local map will be moved
  /// on return and in some cases even the move can be elided by the compiler (RVO)
  return distances;
}

首先，我在源代码中留下了三个问题作为注释。你的回答很受欢迎。
第二，目前，我使用一个小得多的2000 x 2000输入文件做了一些最小的基准测试，它使用了我的中程MacBook Pro (2015年底)大约30秒。我认为这太长了(在我的情况下，性能非常重要)，并感谢您对如何改进这段代码的想法。

快速更新性能

在阅读了地图以防有琐碎的钥匙？之后，我决定用std::unordered_map代替std::map，保持其余的不变。令我相当惊讶的是，这使得执行时间(读取整个文件)减少了4到5次，即从30秒开始。到5-6秒。还不错!
然后，我基于G. Sliepen revised https://stackoverflow.com/a/57562007/3737891修改了我的实现，即用std::vector<int>替换std::map<std::string, std::map<std::string, int>>，并将所有字符串索引保存在单独的std::unordered_map<std::string, size_t>类型容器中。使用这种方法，执行时间缩短到~1-2秒，也就是说，至少比最初的方法快15倍！

c++

arrays

performance

c++11

user-input

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-19 18:16:52

矩阵的高效解析

最有效的方法是将这些值读入一维std::vector<int>中。在第一行之后，您知道输入文件中列的数量。最后，通过将向量的大小除以列数来知道有多少行。然后，将向量重新解释为二维数组。

第一行可以用std::getline()读取，并使用std::istringstream进行解析。但是，其他所有行都应该通过以下操作进行解析：

int value;
file >> value;
distances.push_back(value);

当然，您需要忽略每一行中最左边的列。

通过不逐行读取它，可以避免将该行转换为std::istringstream，这比直接从file解析值要慢。

std::vector<>将在必要时自动调整自身的大小，因此在向量的末尾添加是一个摊销O(1)操作。

最后，在向量中有列乘以行值，如果要访问行y的列y，则必须编写如下内容：

int desired_value = distances[x + y * columns];

按行名和列名访问矩阵元素

如果需要使用行和列的名称访问数据，则需要存储这些名称及其表示的索引。最有效的方法是将它们存储到std::unordered_map<>中，如下所示：

std::unordered_map<std::string, size_t> columns;
std::unordered_map<std::string, size_t> rows;

/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);

/// Process the first line: save all location names into `columns` map
std::string name;
size_t i = 0;

while (iss >> name)
    columns[name] = i++;

/// Process other lines
...

然后，您可以获得给定row和column名称的距离，如下所示：

size_t x = columns[column];
size_t y = rows[row];
int desired_value = distances[x + y * columns.size()];

为什么地图上的地图效率低下

映射被实现为平衡树。当您想要插入某项内容时，它必须遍历树以找到插入新值的位置。一般来说，这需要O(log(N))时间。但是，如果插入新值，使它们总是出现在末尾，则需要频繁地对树进行重新平衡，这使得树的速度更慢。

此外，您的映射为每个值存储列名的副本，为每一行存储行名的副本。因此，使用10000 x 10000元素，您将存储1亿字符串，其中许多字符串是相同的，您对这些字符串根本不感兴趣，只对它们所代表的行或列索引感兴趣。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57561969

复制

相似问题

问从文本文件中高效读取具有字符串索引的大型二维数组(矩阵)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文本文件中高效读取具有字符串索引的大型二维数组(矩阵)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文本文件中高效读取具有字符串索引的大型二维数组(矩阵)
EN