文章/答案/技术大牛

发布

社区首页 >问答首页 >最长的公共子串而不切割单词- python

问最长的公共子串而不切割单词- python
EN

Stack Overflow用户

提问于 2014-03-29 02:15:49

回答 9查看 3.7K关注 0票数 3

考虑到以下情况，我可以找到最长的公共子字符串：

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

输出

foo bar

但是，我如何确保最长的普通子字符串尊重英语单词边界，而不割断一个单词？例如，以下句子：

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出以下内容，这是不需要的，因为它从s2中拆分了单词kappa：

a foo bar

期望的输出仍然是：

foo bar

我也尝试过一种获取最长的公共子字符串的方法，涉及单词边界，但是还有其他的方法来处理字符串而不计算ngram，？(见答覆)

python

string

nlp

substring

longest-substring

回答 9

Stack Overflow用户

回答已采纳

发布于 2014-04-14 16:29:43

这太简单了，无法理解。我用你的代码完成了75%的工作。我首先将句子拆分成单词，然后将其传递给您的函数，以获得最大的公共子字符串(在本例中，它将是最长的连续单词)，因此您的函数给我'foo‘、'bar'，我加入该数组的元素以产生所需的结果。

这是在线工作副本，供您测试、验证和篡改它。

http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'

边缘情况

“.”还有“？”如果在最后一个单词和标点符号之间有一个空格，也被视为有效的单词。如果你不留下一个空格，它们就会被算作最后一个单词的一部分。在这种情况下，“羊”和“羊”？就不再是同一个词了。在调用这样的函数之前，由您来决定如何处理这些字符。在这种情况下 import re s1 = re.sub('[.?]','', s1) s2 = re.sub('[.?]','', s2)

然后像往常一样继续。

票数 10

Stack Overflow用户

发布于 2014-04-14 16:12:34

我的答案不是来自任何官方来源，而是一个简单的观察:至少在我的安装中，您的LCS函数的输出与对(s1，s2)和(s1，s3)的输出不同：

In [1]: s1 = "this is a foo bar sentence ."

In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"

In [4]: s3 = "what a kappa foo bar black sheep ?"

In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '

In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '

您可能会注意到，如果完整的单词是匹配的，那么周围的空格也是匹配的。

然后，可以在函数的输出返回之前修改它，如下所示：

answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
    return longest_common_substring(s1, answer[1:])
else:
    return answer

我确信还有其他边缘情况，比如字符串末尾出现的子字符串，递归地使用s1或s2调用函数，是修剪answer前面还是后面，等等--但至少在显示的情况下，这种简单的修改可以满足您的需要：

In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '

你认为这个方向值得探索吗？

票数 1

Stack Overflow用户

发布于 2014-04-14 19:44:01

只需在代码中添加一个接受条件：

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/22726177

复制

相似问题

问最长的公共子串而不切割单词- python
EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最长的公共子串而不切割单词- pythonEN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最长的公共子串而不切割单词- python
EN