文章/答案/技术大牛

发布

社区首页 >问答首页 >当不需要通配符的某些组合(缺少输入文件)和“合并”规则时，如何在snakemake中使用展开？

问当不需要通配符的某些组合(缺少输入文件)和“合并”规则时，如何在snakemake中使用展开？
EN

Stack Overflow用户

提问于 2018-01-15 13:00:22

回答 1查看 2.7K关注 0票数 0

这是一个比报告的here稍微复杂一些的病例。我的输入文件如下：

ont_Hv1_2.5+.fastq                                                                                                                                                                              
ont_Hv2_2.5+.fastq                                                                                                                                                                              
pacBio_Hv1_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv1_1.5-2.5.fastq                                                                                                                                                                        
pacBio_Hv1_2.5+.fastq                                                                                                                                                                           
pacBio_Hv2_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq

我只想处理现有的输入文件，即自动跳过那些与不存在的输入文件对应的通配符组合。

我的Snakefile看起来如下：

import glob
import os.path
from itertools import product

#make wildcards regexps non-greedy:
wildcard_constraints:
    capDesign = "[^_/]+",
    sizeFrac = "[^_/]+",
    techname = "[^_/]+",

# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)

# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
    if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
        tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
        AUTHORIZEDCOMBINATIONS.append(tup)

# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in product(*args, **kwargs):
            for ac in AUTHORIZEDCOMBINATIONS:
                if(wc_comb[0:3] == ac):
                    print ("SUCCESS")
                    yield(wc_comb)
                    break
    return filtered_combinator

filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)

rule all:
    input:
        expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)

#get read lengths for all FASTQ files:
rule getReadLength:
    input: "{techname}_{capDesign}_{sizeFrac}.fastq"
    output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
    shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s\"\\t\"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field

#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
    input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
    output: "{techname}_{capDesign}_all.readlength.tsv"
    shell: "cat {input} > {output}"

规则getReadLength收集每个输入FASTQ的读取长度(即每个techname, capDesign, sizeFrac组合)。

规则aggReadLength合并getReadLength为每个techname, capDesign组合生成的读取长度统计信息。

以下消息导致工作流失败：

Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq

因此，应用于目标的通配符组合过滤步骤似乎没有传播到它所依赖的所有上游规则。有人知道怎么做吗？

(使用Snakemake版本4.4.0)

提前谢谢

python

wildcard

snakemake

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-01-18 14:25:08

我想我解决了这个问题，希望这对别人有用。

在aggReadlength规则中，我替换了

input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)

使用

input: lambda wildcards: expand("{techname}_{capDesign}_{sizeFrac}.readlength.tsv", filtered_product, techname=wildcards.techname, capDesign=wildcards.capDesign, sizeFrac=SIZEFRACS)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48263541

复制

相似问题

问当不需要通配符的某些组合(缺少输入文件)和“合并”规则时，如何在snakemake中使用展开？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当不需要通配符的某些组合(缺少输入文件)和“合并”规则时，如何在snakemake中使用展开？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当不需要通配符的某些组合(缺少输入文件)和“合并”规则时，如何在snakemake中使用展开？
EN