我想比较同一行的7个不同文件,并显示多个文件中的条目。例如
file1:
ID123 columns with info
ID456 columns with info
ID789 columns with info
file 2:
ID123 columns with info
ID999 columns with info
ID888 columns with info
file3:
ID999 columns with info
ID123 columns with info
ID555 columns with info然后我想打印/展示类似的东西:
file1 and file2 and file3: ID123
file2 and file3: ID999, ID123我已经有这样的东西了:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)但在这种情况下,我想比较7个文件。另外,它是一个由制表符分隔的文件,所以我希望比较每个文件的第一列,并写下重复的文件。我想我需要一个
for i in excelList[1:]:
newlist = newlist.append(i.split("\t")[0])或者类似的东西。即使我列出了7个列表,也很难将它们与".intersection“代码进行比较。
有更简单的方法来归档吗?
发布于 2017-11-10 10:48:17
您可以使用dict将ids映射到文件名列表:
from collections import defaultdict
id_to_files = defaultdict(list)
for filename in filenames:
with open(filename, "rb") as f:
reader = csv.reader(f, delim="\t", ...)
for row in reader:
id = row[0]
id_to_files[id].append(filename)所以你会得到这样的东西:
print(id_to_files)
{
"ID123": ["file1", "file2", "file3"],
"ID999": ["file2", "file3"],
"ID888": ["file2"],
"ID555": ["file3"],
"ID456": ["file1"],
"ID789": ["file1"],
}然后,您可以筛选以删除列出的单个文件的条目(因为它们不是重复的):
duplicates = {k:v for k, v in id_to_files.iteritems() if len(v) > 1}
print(duplicates)
{
"ID123": ["file1", "file2", "file3"],
"ID999": ["file2", "file3"],
}然后,根据具体的期望输出,您最终可能必须构建第二个映射,其中包含最适合输出格式的任何内容.例如,反向映射:
revduplicates = defaultdict(list)
for k, v in duplicates.iteritems():
revduplicates[tuple(v)].append(k)
print(revduplicates)
{
('file1', 'file2', 'file3'): ['ID123'],
('file2', 'file3'): ['ID999'],
}对于您所描述的确切输出,您还需要几个步骤,但这至少会让您开始工作。
https://stackoverflow.com/questions/47220815
复制相似问题