我正在尝试一种有时会跳过达特普任务的变体,基于这里的教程:http://nschneid.github.io/ducttape-crash-course/tutorial5.html
(达特普是一个基于Bash/Scala的工作流管理工具。)
我正在尝试做一个跨产品来对“干净”数据和“脏”数据执行task1。这样做的目的是通过相同的路径,但在某些情况下不需要预处理。要做到这一点,我需要做一个跨产品的任务。
task cleanup < in=(Dirty: a=data/a b=data/b) > out {
prefix=$(cat $in)
echo "$prefix-clean" > $out
}
global {
data=(Data: dirty=(Dirty: a=data/a b=data/b) clean=(Clean: a=$out@cleanup b=$out@cleanup))
}
task task1 < in=$data > out
{
cat $in > $out
}
plan FinalTasks {
reach task1 via (Dirty: *) * (Data: *) * (Clean: *)
}这是执行计划。我希望有6个任务,但我有两个重复的任务正在执行。
$ ducttape skip.tape
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 7 previous workflow versions
Finding hyperpaths contained in plan...
Found 8 vertices implied by realization plan FinalTasks
Union of all planned vertices has size 8
Checking for completed tasks from versions 1 through 7...
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Baseline.baseline (Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Dirty.b (Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Baseline.baseline (Data.dirty+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Dirty.b (Data.dirty+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean+Dirty.b (Clean.b+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean+Dirty.b (Clean.a+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean (Clean.a+Data.clean+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean (Clean.b+Data.clean+Dirty.a)
Are you sure you want to run these 8 tasks? [y/n] 从下面的输出中删除符号链接,我的副本在这里:
$ head task1/*/out
==> Baseline.baseline/out <==
1
==> Clean.b+Data.clean/out <==
1-clean
==> Data.clean/out <==
1-clean
==> Clean.b+Data.clean+Dirty.b/out <==
2-clean
==> Data.clean+Dirty.b/out <==
2-clean
==> Dirty.b/out <==
2有ducttape经验的人能帮我发现我的跨产品问题吗?
[1]: https://github.com/jhclark/ducttape发布于 2014-05-16 16:16:11
那么,为什么我们有4个实现涉及在task1的分支点清洁,而不是仅仅两个?
这个问题的答案是,在达特普分支点总是传播通过一个任务的所有传递依赖关系。因此,任务“清理”中的分支点"Dirty“通过clean=(Clean: a=$out@cleanup b=$out@cleanup)传播。此时变量“洁净”包含原始“脏”和新引入的“清洁”分支点的交叉积。
要做的最小的改变就是改变
clean=(Clean: a=$out@cleanup b=$out@cleanup)至
clean=$out@cleanup这将为您提供所需的实现数,但是使用分支点名称"Dirty“来控制您使用的输入数据集有点混乱--只要进行最小的更改,任务”清理“的两个实现就是(Dirty: a)。
它可能会使您的工作流更易于像这样重构它:
global {
raw_data=(DataSet: a=data/a b=data/b)
}
task cleanup < in=$raw_data > out {
prefix=$(cat $in)
echo "$prefix-clean" > $out
}
global {
ready_data=(DoCleanup: no=$raw_data yes=$out@cleanup)
}
task task1 < in=$ready_data > out
{
cat $in > $out
}
plan FinalTasks {
reach task1 via (DataSet: *) * (DoCleanup: *)
}https://stackoverflow.com/questions/23698707
复制相似问题