文章/答案/技术大牛

发布

社区首页 >问答首页 >从两个文件中删除重复文件并合并唯一的文件

问从两个文件中删除重复文件并合并唯一的文件
EN

Ask Ubuntu用户

提问于 2015-05-18 21:11:46

回答 2查看 4.5K关注 0票数 1

我有两个大的文本文件，checksums_1.txt和checksums_2.txt，我想解析这些文件并删除它们之间的重复，并将唯一的行合并到一个文件中。

每个文件对每一行都有如下结构。

大小，md5，路径

示例: Checksums_1.txt

9565, a4fs614as6fas4fa4s46fsaf1, /mnt/app/1tier/2tier/filename.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/app/1tier/2tier/filename2.exe

示例: Checksums_2.txt

9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/filename.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/filename2.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/newfile.exe

用于检查checksums_1.txt和checksums_2.txt之间的部分是在挂载点/mnt/app/和/mnt/temp/ (换句话说，从每一行的开头到挂载点/mnt/temp/或/mnt/app/的末尾)之后。

checksums_1.txt中的数据更为重要，因此，如果发现重复的数据，则必须将checksums_1.txt中的行移到合并文件中。

部分Checksums_1.txt

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

部分Checksums_2.txt

1058,b8203a236b4f1531616318284202c9e6,/mnt/temp/Certificados/ca.crt
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial 
2694,8a815adefde4fa0c263e74832b15de64,/mnt/temp/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/temp/Certificados/ca.db.index

合并文件的示例

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt 
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial 
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

command-line

text-processing

回答 2

Ask Ubuntu用户

回答已采纳

发布于 2015-05-19 18:29:05

假设这两个文件都不是很大，下面的python脚本也将完成这项工作。

是如何工作的

两个文件都由脚本读取。file_1中的行(具有优先级的文件)由您在head部分(在示例/mnt/app/中)为文件输入的目录分隔。

随后，file_1中的行被写入输出文件(合并文件)。同时，如果行中出现标识字符串(挂载点后面的部分)，则从file_2中删除行列表中的行。最后，file_2的“剩余”行(在file_1中不存在dupe )也被写入输出文件。结果：

file_1：

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

file_2：

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

合并：

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial

脚本

#!/usr/bin/env python3
#---set the path to file1, file2 and the mountpoint used in file1 below
f1 = "/path/to/file_1"; m_point = "/mnt/app"; f2 = "/path/to/file_2"
merged = "/path/to/merged_file"
#---
lines1 = [(l, l.split(m_point)[-1]) for l in open(f1).read().splitlines()]
lines2 = [l for l in open(f2).read().splitlines()]

for l in lines1:
    open(merged, "a+").write(l[0]+"\n")
    for line in [line for line in lines2 if l[1] in line]:
            lines2.remove(line)

for l in lines2:
    open(merged, "a+").write(l+"\n")

如何使用

将脚本复制到一个空文件中，保存为merge.py
在脚本的head部分，设置f1 (file_1)、f2、合并文件的路径和file_1中提到的挂载点的路径。
通过命令运行它: python3 /path/to/merge.py

编辑

或者稍微短一点：

#!/usr/bin/env python3
#---set the path to file1, file2 and the mountpoint used in file1 below
f1 = "/path/to/file_1"; m_point = "/mnt/app"; f2 = "/path/to/file_2"
merged = "/path/to/merged_file"
#---
lines = lambda f: [l for l in open(f).read().splitlines()]
lines1 = lines(f1); lines2 = lines(f2); checks = [l.split(m_point)[-1] for l in lines1]
for item in sum([[l for l in lines2 if c in l] for c in checks], []):
    lines2.remove(item)
for item in lines1+lines2:
    open(merged, "a+").write(item+"\n")

票数 1

Ask Ubuntu用户

发布于 2015-05-19 15:00:20

如果您愿意使用python (因此，如果性能不是问题)，可以使用以下脚本实现所需的功能：

#!/usr/bin/env python3

import sys
import csv
import re

mountpoint1 = "/mnt/app/"
mountpoint2 = "/mnt/temp/"

if (len(sys.argv) != 4):
    print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
    exit(1)

inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]

# We place entries from both input files in the same dictionary
# The key will be the filename stripped of the mountpoint
# The value will be the whole line
fileDictionary = dict()

# First we read entries from file2, so that those
# from file2 will later overwrite them when needed
with open(inputFileName2) as inputFile2:
    csvReader = csv.reader(inputFile2)
    for row in csvReader:
        if len(row) == 3:
            # The key will be the filename stripped of the mountpoint
            key = re.sub(mountpoint2, '', row[2])
            # The value will be the whole line
            fileDictionary[key] = ','.join(row)

# Entries from file1 will overwrite those from file2
with open(inputFileName1) as inputFile1:
    csvReader = csv.reader(inputFile1)
    for row in csvReader:
        if len(row) == 3:
            # The key will be the filename stripped of the mountpoint
            key = re.sub(mountpoint1, '', row[2])
            # The value will be the whole line
            fileDictionary[key] = ','.join(row)

# Write all the entries to the output file
with open(outputFileName, 'w') as outputFile:
    for key in fileDictionary:
        outputFile.write(fileDictionary[key])
        outputFile.write('\n')

只需将脚本保存为merge-checksums.py，给它执行权限

chmod u+x merge-checksums.py

并将其运行为：

./merge-checksums.py Checksums_1.txt Checksums_2.txt out.txt

票数 1

页面原文内容由Ask Ubuntu提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://askubuntu.com/questions/625407

复制

相似问题

问从两个文件中删除重复文件并合并唯一的文件
EN

回答 2

Ask Ubuntu用户

是如何工作的

脚本

如何使用

编辑

Ask Ubuntu用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从两个文件中删除重复文件并合并唯一的文件EN

回答 2

Ask Ubuntu用户

是如何工作的

脚本

如何使用

编辑

Ask Ubuntu用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从两个文件中删除重复文件并合并唯一的文件
EN