我正在尝试解决DNA问题,这是更多的改进(?)LCS问题的版本。在这个问题中,有字符串,它是字符串和半子字符串,它允许字符串的一部分有一个或没有跳过一个字母。例如,对于字符串"desktop",它有半子串{"destop", "dek", "stop", "skop","desk","top"},所有的半子串都有一个或没有跳过一个字母。
现在,我得到了两个由{a,t,g,c}组成的DNA字符串。我正在尝试寻找最长的半子串,LSS。如果有多个LSS,则以最快的顺序打印出其中的一个。
例如,两个dnas {attgcgtagcaatg, tctcaggtcgatagtgac}打印出"tctagcaatg"
然后aaaattttcccc, cccgggggaatatca打印出"aattc"
我正在尝试使用常见的LCS算法,但无法使用表格解决它,尽管我确实解决了没有跳过字母的算法。有什么建议吗?
发布于 2019-04-14 10:39:43
这是用Python编写的LCS动态编程解决方案的变体。
首先,我为所有的子串构建一个Suffix Tree,这些子串可以通过跳过规则从每个字符串中生成。然后我将后缀树相交。然后我在寻找可以从交叉树中得到的最长的字符串。
请注意,从技术上讲,这是O(n^2)。最坏的情况是两个字符串是相同的字符,反复重复。因为你最终得到了很多逻辑上类似的东西,“一个字符串中位置42的'l‘可能与另一个字符串中位置54的位置l匹配”。但在实践中,它将是O(n)。
def find_subtree (text, max_skip=1):
tree = {}
tree_at_position = {}
def subtree_from_position (position):
if position not in tree_at_position:
this_tree = {}
if position < len(text):
char = text[position]
# Make sure that we've populated the further tree.
subtree_from_position(position + 1)
# If this char appeared later, include those possible matches.
if char in tree:
for char2, subtree in tree[char].iteritems():
this_tree[char2] = subtree
# And now update the new choices.
for skip in range(max_skip + 1, 0, -1):
if position + skip < len(text):
this_tree[text[position + skip]] = subtree_from_position(position + skip)
tree[char] = this_tree
tree_at_position[position] = this_tree
return tree_at_position[position]
subtree_from_position(0)
return tree
def find_longest_common_semistring (text1, text2):
tree1 = find_subtree(text1)
tree2 = find_subtree(text2)
answered = {}
def find_intersection (subtree1, subtree2):
unique = (id(subtree1), id(subtree2))
if unique not in answered:
answer = {}
for k, v in subtree1.iteritems():
if k in subtree2:
answer[k] = find_intersection(v, subtree2[k])
answered[unique] = answer
return answered[unique]
found_longest = {}
def find_longest (tree):
if id(tree) not in found_longest:
best_candidate = ''
for char, subtree in tree.iteritems():
candidate = char + find_longest(subtree)
if len(best_candidate) < len(candidate):
best_candidate = candidate
found_longest[id(tree)] = best_candidate
return found_longest[id(tree)]
intersection_tree = find_intersection(tree1, tree2)
return find_longest(intersection_tree)
print(find_longest_common_semistring("attgcgtagcaatg", "tctcaggtcgatagtgac"))发布于 2019-04-14 06:06:50
让g(c, rs, rt)表示字符串中最长的公共半子字符串S和T,以rs和rt结尾,其中rs和rt分别是字符c在S和T中的排名出现次数,K是允许的跳转次数。然后我们可以形成一个递归,我们必须对S和T中的所有c对执行该递归。
JavaScript代码:
function f(S, T, K){
// mapS maps a char to indexes of its occurrences in S
// rsS maps the index in S to that char's rank (index) in mapS
const [mapS, rsS] = mapString(S)
const [mapT, rsT] = mapString(T)
// h is used to memoize g
const h = {}
function g(c, rs, rt){
if (rs < 0 || rt < 0)
return 0
if (h.hasOwnProperty([c, rs, rt]))
return h[[c, rs, rt]]
// (We are guaranteed to be on
// a match in this state.)
let best = [1, c]
let idxS = mapS[c][rs]
let idxT = mapT[c][rt]
if (idxS == 0 || idxT == 0)
return best
for (let i=idxS-1; i>=Math.max(0, idxS - 1 - K); i--){
for (let j=idxT-1; j>=Math.max(0, idxT - 1 - K); j--){
if (S[i] == T[j]){
const [len, str] = g(S[i], rsS[i], rsT[j])
if (len + 1 >= best[0])
best = [len + 1, str + c]
}
}
}
return h[[c, rs, rt]] = best
}
let best = [0, '']
for (let c of Object.keys(mapS)){
for (let i=0; i<(mapS[c]||[]).length; i++){
for (let j=0; j<(mapT[c]||[]).length; j++){
let [len, str] = g(c, i, j)
if (len > best[0])
best = [len, str]
}
}
}
return best
}
function mapString(s){
let map = {}
let rs = []
for (let i=0; i<s.length; i++){
if (!map[s[i]]){
map[s[i]] = [i]
rs.push(0)
} else {
map[s[i]].push(i)
rs.push(map[s[i]].length - 1)
}
}
return [map, rs]
}
console.log(f('attgcgtagcaatg', 'tctcaggtcgatagtgac', 1))
console.log(f('aaaattttcccc', 'cccgggggaatatca', 1))
console.log(f('abcade', 'axe', 1))
https://stackoverflow.com/questions/55667123
复制相似问题