首先,我知道有许多高度相关的问题,但我的第一次实现(基于这些Q&Q的一些建议)不够有效。
我正在寻找一种方法来(显着地)改进我第一次从输入文本文件中读取带有字符串索引的庞大(>10000x10000)非对称非稀疏二维数组(矩阵)的实现。另外,假设我们事先不知道矩阵的大小。
外部输入文件的结构(如任意两个位置之间的距离矩阵一样)如下所示:
A B C D E F G
A 0 10 20 30 40 50 60
B 15 0 25 35 45 55 65
C 20 30 0 40 50 60 70
D 25 35 45 0 65 75 85
E 15 20 25 35 0 55 65
F 20 30 40 50 60 0 70
G 35 45 55 65 75 85 0目前,我想出了以下解决方案:
std::map<std::string, std::map<std::string, int>>
ReadDistancesFromFile(const char *name) {
std::string filename(name);
std::clog << "Trying to open and read: " << filename << std::endl;
std::ifstream file(name);
/// If .is_open() returns False, perror prints the error code stored in errno
if (!file.is_open())
std::perror(("Error while opening file " + filename).c_str());
/// Map of maps to save all read distances
std::map<std::string, std::map<std::string, int>> distances;
/* 1. Is such an efficient structure (container) for my purpose:
a) to store data efficiently
b) to access data using indices quickly?
c) to update values time after time
d) insertion/deletion of new elements doesn't happen often */
/// Vector to store all `String` type indices
std::vector<std::string> indices;
/// String to store index (location name)
std::string index;
/// Store line from the external file
std::string line;
/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);
/// Process the first line: save all location names into `indices` vector
while (iss >> index) {
indices.push_back(index);
}
/* 2. Probably I could use .reserve() before the while loop?
The problem that I don't know the size in advance. */
/// Read the file via std::getline(). Rules obeyed:
/// - first the I/O operation, then error check, then data processing
/// - failbit and badbit prevent data processing, eofbit does not
while (std::getline(file, line)) {
std::istringstream is(line);
/* 3. Is it efficient to define a stringstream variable inside a loop? */
/// For each new line (matrix row), read the first String element (location name)
is >> index;
int distance; // To store distance value
uint column = 0; // Column number to access location names from `indices` vector
/// Process the line further: store Int distances from the input stream
while (is >> distance) {
distances[index][indices[column++]] = distance;
}
}
/// Only in case of set badbit we are sure that errno has been set
/// Use perror() to print error details
if (file.bad())
std::perror(("Error while reading file " + filename).c_str());
/// close file
file.close();
/// With C++11, std::map has move-semantics, which means the local map will be moved
/// on return and in some cases even the move can be elided by the compiler (RVO)
return distances;
}快速更新性能
std::unordered_map代替std::map,保持其余的不变。令我相当惊讶的是,这使得执行时间(读取整个文件)减少了4到5次,即从30秒开始。到5-6秒。还不错!std::vector<int>替换std::map<std::string, std::map<std::string, int>>,并将所有字符串索引保存在单独的std::unordered_map<std::string, size_t>类型容器中。使用这种方法,执行时间缩短到~1-2秒,也就是说,至少比最初的方法快15倍!发布于 2019-08-19 18:16:52
矩阵的高效解析
最有效的方法是将这些值读入一维std::vector<int>中。在第一行之后,您知道输入文件中列的数量。最后,通过将向量的大小除以列数来知道有多少行。然后,将向量重新解释为二维数组。
第一行可以用std::getline()读取,并使用std::istringstream进行解析。但是,其他所有行都应该通过以下操作进行解析:
int value;
file >> value;
distances.push_back(value);当然,您需要忽略每一行中最左边的列。
通过不逐行读取它,可以避免将该行转换为std::istringstream,这比直接从file解析值要慢。
std::vector<>将在必要时自动调整自身的大小,因此在向量的末尾添加是一个摊销O(1)操作。
最后,在向量中有列乘以行值,如果要访问行y的列y,则必须编写如下内容:
int desired_value = distances[x + y * columns];按行名和列名访问矩阵元素
如果需要使用行和列的名称访问数据,则需要存储这些名称及其表示的索引。最有效的方法是将它们存储到std::unordered_map<>中,如下所示:
std::unordered_map<std::string, size_t> columns;
std::unordered_map<std::string, size_t> rows;
/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);
/// Process the first line: save all location names into `columns` map
std::string name;
size_t i = 0;
while (iss >> name)
columns[name] = i++;
/// Process other lines
...然后,您可以获得给定row和column名称的距离,如下所示:
size_t x = columns[column];
size_t y = rows[row];
int desired_value = distances[x + y * columns.size()];为什么地图上的地图效率低下
映射被实现为平衡树。当您想要插入某项内容时,它必须遍历树以找到插入新值的位置。一般来说,这需要O(log(N))时间。但是,如果插入新值,使它们总是出现在末尾,则需要频繁地对树进行重新平衡,这使得树的速度更慢。
此外,您的映射为每个值存储列名的副本,为每一行存储行名的副本。因此,使用10000 x 10000元素,您将存储1亿字符串,其中许多字符串是相同的,您对这些字符串根本不感兴趣,只对它们所代表的行或列索引感兴趣。
https://stackoverflow.com/questions/57561969
复制相似问题