考虑到以下情况,我可以找到最长的公共子字符串:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
print longest_common_substring(s1, s2)输出
foo bar但是,我如何确保最长的普通子字符串尊重英语单词边界,而不割断一个单词?例如,以下句子:
s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)输出以下内容,这是不需要的,因为它从s2中拆分了单词kappa:
a foo bar期望的输出仍然是:
foo bar我也尝试过一种获取最长的公共子字符串的方法,涉及单词边界,但是还有其他的方法来处理字符串而不计算ngram,?(见答覆)
发布于 2014-04-14 16:29:43
这太简单了,无法理解。我用你的代码完成了75%的工作。我首先将句子拆分成单词,然后将其传递给您的函数,以获得最大的公共子字符串(在本例中,它将是最长的连续单词),因此您的函数给我'foo‘、'bar',我加入该数组的元素以产生所需的结果。
这是在线工作副本,供您测试、验证和篡改它。
http://repl.it/RU0/1
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
def longest_common_sentence(s1, s2):
s1_words = s1.split(' ')
s2_words = s2.split(' ')
return ' '.join(longest_common_substring(s1_words, s2_words))
s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'边缘情况
import re
s1 = re.sub('[.?]','', s1)
s2 = re.sub('[.?]','', s2)然后像往常一样继续。
发布于 2014-04-14 16:12:34
我的答案不是来自任何官方来源,而是一个简单的观察:至少在我的安装中,您的LCS函数的输出与对(s1,s2)和(s1,s3)的输出不同:
In [1]: s1 = "this is a foo bar sentence ."
In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"
In [4]: s3 = "what a kappa foo bar black sheep ?"
In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '
In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '您可能会注意到,如果完整的单词是匹配的,那么周围的空格也是匹配的。
然后,可以在函数的输出返回之前修改它,如下所示:
answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
return longest_common_substring(s1, answer[1:])
else:
return answer我确信还有其他边缘情况,比如字符串末尾出现的子字符串,递归地使用s1或s2调用函数,是修剪answer前面还是后面,等等--但至少在显示的情况下,这种简单的修改可以满足您的需要:
In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '你认为这个方向值得探索吗?
发布于 2014-04-14 19:44:01
只需在代码中添加一个接受条件:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest and word_aligned(x, y, m[x][y]): # acceptance condition
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
def word_aligned(x, y, length):
"""check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
# check start of match in s1
if s1[x - 1].isspace():
# match doesn't start with a character, reject
return False
if x - 2 > 0 and not s1[x - 2].isspace():
# char before match is not start of line or space, reject
return False
# check start of match in s2
... same as above ...
# check end of match in s1
... your code is a bit hard for me follow, what is end of match? ...
# check end of match in s2
... same as above ...
return True
print longest_common_substring(s1, s2)https://stackoverflow.com/questions/22726177
复制相似问题