我有一组BAM文件,这些文件是用BWA-MEM生成的,并使用GATK、IndelRealigner等进行了进一步的处理。为了加快处理速度,我正在用较小的块对BAM文件进行预处理。但是,我必须将这些单独的文件合并到一个BAM文件中,然后再进行变体调用,这是我的Snakemake管道的一个主要问题。
我的输入文件遵循这种命名约定。
# Sample 1 BAM files
OVCA-1-FRESH-1_S16_L001_realigned.bam
OVCA-1-FRESH-1_S16_L002_realigned.bam
OVCA-1-FRESH-1_S16_L003_realigned.bam
OVCA-1-FRESH-1_S16_L004_realigned.bam
# Sample 2 BAM files
OVCA-2-FRESH-1_S16_L001_realigned.bam
OVCA-2-FRESH-1_S16_L002_realigned.bam
OVCA-2-FRESH-1_S16_L003_realigned.bam
OVCA-2-FRESH-1_S16_L004_realigned.bam有问题的管道是这样的:
# Map start input files
RUN_ID, LINE = glob_wildcards('{run_id}_L{line}_realigned.bam')
rule all:
input:
expand('{run_id}_realigned.bam', run_id=RUN_ID)
# Map input files for merging. This function should collect all
# BAM files that match the {run_id} wildcard.
def samtools_merge_inputs(wildcards):
files = expand('{run_id}_L{line}_realigned.bam', run_id=RUN_ID, line=LINES)
return files
# Perform BAM merging.
rule samtools_merge:
input:
samtools_merge_inputs
output:
'{run_id}_realigned.bam
shell:
'samtools merge -h {input} {output}'我尝试构建一个输入函数,它收集与当前处理的通配符匹配的所有可用输入文件。当我对管道执行试运行时,我可以看到函数samtools_merge_inputs没有正常工作,因为它收集所有可用的BAM文件并多次重复它们:
rule samtools_merge:
input:
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam,
OVCA-2-FRESH-1_S4_L001_realigned.bam,
OVCA-2-FRESH-1_S4_L002_realigned.bam,
OVCA-2-FRESH-1_S4_L003_realigned.bam,
OVCA-2-FRESH-1_S4_L004_realigned.bam
output:
OVCA-1-FRESH-1_S16_realigned.bam
jobid:
18
wildcards:
run_id=OVCA-1-FRESH-1_S16它应该是这样的:
rule samtools_merge:
input:
OVCA-1-FRESH-1_S16_L001_realigned.bam,
OVCA-1-FRESH-1_S16_L002_realigned.bam,
OVCA-1-FRESH-1_S16_L003_realigned.bam,
OVCA-1-FRESH-1_S16_L004_realigned.bam
output:
OVCA-1-FRESH-1_S16_realigned.bam
jobid:
18
wildcards:
run_id=OVCA-1-FRESH-1_S16我应该如何编辑samtools_merge_inputs函数来收集所需的输入文件?samtools_merge_inputs--我确实认识到,我可以简单地忘记输入函数,只需用通配符向samtools_merge输入四个输入文件,但是我真的很想学习如何在这种情况下使用输入函数,因为我在其他管道中也面临类似的问题。我试图从其他帖子中寻求帮助,但到目前为止,我还没有找到符合我目的的答案。
谢谢你的帮助!
发布于 2018-08-16 12:49:50
您的函数在这里不使用通配符。应该是这样的:
def samtools_merge_inputs(wildcards):
files = expand(wildcards.run_id+'_L{line}_realigned.bam', line=LINES)
return files当然,如果你在所有车道上都有样品。调用函数时,所有通配符都作为对象传递到函数的wildcards参数中。
你也可以:
files = expand('{run_id}_L{line}_realigned.bam', run_id=wildcards.run_id, line=LINES) 你有许多在你的蛇形文件中行不通的东西。
首先,在samtools合并规则中缺少一个“‘”:
rule samtools_merge:
input:
samtools_merge_inputs
output:
'{run_id}_realigned.bam'<-----
shell:
'samtools merge -h {input} {output}'注意变量名(行与行)
其次,函数glob_wildcards()将返回找到的所有值的列表,这意味着您的两个变量如下所示:
RUN_ID, LINES = glob_wildcards('{run_id}_L{line}_realigned.bam')
print(RUN_ID)
['OVCA-2-FRESH-1_S16', 'OVCA-2-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-1-FRESH-1_S16', 'OVCA-2-FRESH-1_S16', 'OVCA-2-FRESH-1_S16']
print(LINES)
['002', '001', '001', '002', '004', '003', '003', '004']我肯定不是你想要的。解决办法是有正确的结构来描述你的样本。例如(同样,如果所有样本都在所有车道上):
RUN_ID = ["OVCA-1-FRESH-1_S16","OVCA-2-FRESH-1_S16"]
LINES = ["1","2","3","4"]最后一件事:您的输入和输出不能用通配符来区分,这意味着您将得到一个错误Cyclic dependency on rule samtools_merge或RecursionError: maximum recursion depth exceeded in comparison。我建议您为输出选择不同的名称。所有这些加在一起:
# Map start input files
RUN_ID = ["OVCA-1-FRESH-1_S16","OVCA-2-FRESH-1_S16"]
LINES = ["001","002","003","004"]
rule all:
input:
expand('{run_id}_realignedFinal.bam', run_id=RUN_ID)
# Map input files for merging. This function should collect all
# BAM files that match the {run_id} wildcard.
def samtools_merge_inputs(wildcards):
files = expand('{run_id}_L{line}_realigned.bam', run_id=wildcards.run_id, line=LINES)
return files
# Perform BAM merging.
rule samtools_merge:
input:
samtools_merge_inputs
output:
'{run_id}_realignedFinal.bam'
shell:
'samtools merge -h {input} {output}'尚未检查shell命令,但我的医生说:
Usage: samtools merge [-nurlf] [-h inh.sam] [-b <bamlist.fofn>] <out.bam> <in1.bam> [<in2.bam> ... <inN.bam>]
https://stackoverflow.com/questions/51855608
复制相似问题