假设我有示例SAMPLE_A,分为两个文件SAMPLE_A_1, SAMPLE_A_2和与条形码AATT, TTAA相关联的SAMPLE_B,以及与条形码CCGG, GGCC, GCGC相关联的SAMPLE_B,分为4个文件SAMPLE_B_1...SAMPLE_B_4。
我可以创建getSampleNames()来获取[SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B]和[1,2,1,2,3,4],然后压缩它们以获得组合{sample}_{id}。然后我可以对条形码做同样的事情:[SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B]和[AATT, TTAA,CCGG, GGCC, GCGC]。
SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
"isoseq3 refine {input} "
rule split:
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
"python {params.script_dir}/split_cells_bam.py"
rule dedup_split:
input:
"{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
"isoseq3 dedup {input} {output.bam} "
rule merge:
input:
expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
zip, sample = SAMPLES_BC, barcode = BCs),如何防止规则拆分成为我的管道中的瓶颈?现在,它等待对所有样本执行细化规则,而不是必要的,每个示例应该独立运行,但是我不能,因为每个示例的条形码集是不同的。有没有办法让你
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC]),{sample} of SAMPLES_BC是BCs字典中的一个键吗?IDs也一样吗?我知道我可以使用函数,但是我不知道如何通过规则传播{barcode}
发布于 2022-03-01 01:33:38
我找到了如何通过函数使用字典,这解决了我的问题!
此解决方案的主要默认设置是必须创建一个虚拟文件作为拆分规则的输出,而不是检查每个“{sample}/cell/{条形码}{sample}/fltnc.bam”文件是否已创建,因此我仍在寻找更优雅的.
IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
SAMPLES = list(IDs.keys())
BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
# function linking IDs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam',
id = IDs[wildcards.sample])
# function linking BCs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
barcode = BCs[wildcards.sample])
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
rule split:
input:
sample2ids
output:
# cannot use a function here, so I create a dummy file to pipe
'dummy_file.txt'
rule dedup_split:
input:
'dummy_file.txt'
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
rule merge:
input:
sample2bc发布于 2022-02-28 13:54:09
根据您的评论,有几条路线可供选择,包括更改包含示例、条形码和ids的数据结构。现在,您只需在每个示例中创建一个规则:
for sample in set(SAMPLES_ID): # get uniq samples
# get ids and barcodes for this sample
ids = [tup[1] for tup in zip(SAMPLES_ID, IDs) if tup[0] == sample]
bcs = [tup[1] for tup in zip(SAMPLES_BC, BCs) if tup[0] == sample]
rule:
name: f'{sample}_split'
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
sample = sample, id = ids),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam",
sample = sample, barcode = bcs),
shell:
"python {params.script_dir}/split_cells_bam.py"您不需要在展开中压缩,因为ids和bcs是针对单个示例的。总的来说,我不认为这是最好的方法,但是对于您当前的工作流来说,这将是最简单的方法。
只要注意到shell命令,如何将输入/输出传递给脚本?
https://stackoverflow.com/questions/71287637
复制相似问题