问在python中以行分隔文件
EN

Stack Overflow用户

提问于 2020-05-14 03:26:20

回答 1查看 37关注 0票数 0

我有一个.fastq文件(不能使用Biopython)，它由不同行中的多个样本组成。文件内容如下所示：

@sample1
ACGTC.....
+
IIIIDDDDDFF
@sample2
AGCGC....
+
IIIIIDFDFD
.
.
.
@sampleX
ACATAG
+
IIIIIDDDFFF

我想获取文件并分离出每一组单独的样本(即第1-4行、5-8行等，直到文件结束)，并将它们分别写入一个单独的文件(即sample1.fastq包含样本1的内容第1-4行，依此类推)。在python中使用循环可以做到这一点吗？

python

回答 1

Stack Overflow用户

发布于 2020-05-14 03:46:50

为此，您可以使用defaultdict和regex

import re
from collections import defaultdict

# Get file contents
with open("test.fastq", "r") as f:
    content = f.read()

samples = defaultdict(list) # Make defaultdict of empty lists
identifier = ""

# Iterate through every line in file
for line in content.split("\n"):
    # Find strings which start with @
    if re.match("^@.*", line):
        # Set identifier to match following lines to this section
        identifier = line.replace("@", "")
    else:
        # Add the line to its identifier
        samples[identifier].append(line)

现在，您要做的就是将此默认字典的内容保存到多个文件中：

# Loop through all samples (and their contents)
for sample_name, sample_items in samples.items():
    # Create new file with the name of its sample_name.fastq
    # (You might want to change the naming)
    with open(f"{sample_name}.fastq", "w") as f:
        # Write each element of the sample_items to new line
        f.write("\n".join(sample_items))

在文件的开头(第一行)包含@sample_name可能会对您有所帮助，但我不确定您是否希望这样做，所以我没有添加它。

请注意，您可以将正则表达式设置调整为仅匹配@sample[number]，而不是所有@...__，如果需要，也可以使用re.match("^@sample\d+")

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61783234

复制

相似问题

问在python中以行分隔文件
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中以行分隔文件EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中以行分隔文件
EN