我在找克洛尔的树丛。基本上,我需要找到发生在基因组L大小窗口中的所有k长度子串。我已经实现了我认为的解决方案,但是我相信其中可能会有错误,因为我用来确认的系统(beta.stepic.org)告诉我了。你们能看到我搞砸的地方吗?我的解决方案如下:查找所有顶级的k(k长度子字符串)并找到它们的起始索引。然后,我用t组进行分区,这意味着这是它们发生的次数,基本上是对一个偏移量为k的分区组中的最后一项和第一项的差异(因为所有的k都应该适合L-窗口,这将通过扩展它来解释最后一个times )。指数按上升顺序排列。窃听器在哪?
丛集查找问题:寻找在字符串中形成丛集的模式。
Input: A string Genome, and integers k, L, and t.
Output: All distinct k-mers forming (L, t)-clumps in Genome.样本输入
基因组:
k: 5
L: 50
t: 4样本输出
CGACA GAAGA
(defn get-indices [source target]
"Returns the indices for the substring target
found in source in ascending order. This includes overlaps."
(let
[search (java.util.regex.Pattern/compile (str "(?=(" target "))"))
matcher (re-matcher search source)
not-nil? (complement nil?)]
(defn inner [matcher]
(if (not-nil? (re-find matcher))
(cons (.start matcher) (inner matcher))))
(inner matcher)))
(defn get-frequent-kmer [source k]
"Gets the most frequenct k-mers of size k from source"
(let [max-val (val (apply max-key val (frequencies (partition k 1 source))))]
(map first (filter #(= (val %) max-val)
(frequencies (map (partial apply str) (partition k 1 source)))))))
(defn find-clumps [genome k L t]
(for [k-mer (get-frequent-kmer genome k)]
(let [indices (get-indices genome k-mer)]
(if (some true? (map #(<= (+ k (- (last %) (first %))) L)
(partition t 1 indices))) k-mer))))发布于 2013-11-20 16:38:09
除了有一些可以改进的代码样式之外,我看到的主要问题是,您正在max-key val上过滤things,而在初始筛选时根本没有考虑到t。
当您找到大小为k的最常见的longer时,您只需保留较长的:
(apply max-key val (frequencies (partition k 1 source)))因为你用max-val过滤
(filter #(= (val %) max-val)你只是在分析这些:
(for [k-mer (get-frequent-kmer genome k)]问题是,如果t是4,但是有些5-mers有超过4个重复,那么你就把那些重复的4次抛出了方程式。
发布于 2013-11-20 21:07:41
以下是一些工作代码:
(defn k-mers
"Returns a seq of all k-mers in text."
[k text]
(map #(apply str %) (partition k 1 text)))
(defn most-frequent-k-mers
"Returns a seq of k-mers in text appearing at least t times."
[k t text]
(->> (k-mers k text)
(frequencies)
(filter #(<= t (second %)))
(map first)))
(defn find-clump
"Finds k-mers forming (L, t) clumps in text."
[k L t text]
(let [windows (partition L 1 text)]
(->> windows
(map #(most-frequent-k-mers k t %))
(map set)
(apply clojure.set/union))))我觉得你应该从这里开始。
https://stackoverflow.com/questions/20087842
复制相似问题