我有一个文本和一个概念清单如下。
concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"我希望识别列表中的concepts是否在text中,并将concepts[1:]的所有出现替换为concepts[0]。因此,上述案文的结果应是;
"levels and data mining of dna data mining methylation"我的代码如下:
concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"
if any(word in text for word in concepts):
for terms in concepts[1:]:
if terms in text:
text=text.replace(terms,concepts[0])
text=' '.join(text.split())
print(text)然而,我得到的输出为;
levels and data mining mining of dna data mining source methylation看起来,data这个概念被data mining取代了,这是不正确的。更具体地说,我希望在替换时首先考虑最长的选项。
即使我更改了concepts的顺序,它也不起作用。
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
if any(word in text for word in concepts):
for terms in concepts[1:]:
if terms in text:
text=text.replace(terms,concepts[0])
text=' '.join(text.split())
print(text)我得到了上述代码的如下输出。
levels and data mining mining of dna data mining mining methylation如果需要,我很乐意提供更多的细节。
发布于 2019-01-22 00:15:09
这里的问题是您的迭代策略,每次只做一个替换。因为您的替换项包含要替换的术语之一,因此您将在以前的迭代中对已经更改为替换项的内容进行替换。
解决这一问题的一种方法是原子化地完成所有这些替换,使它们都同时发生,并且输出不会影响其他替换的结果。在这方面有几个战略:
#2的一个例子是Python的sub()库的re方法。下面是一个使用它的例子:
import re
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
targets = sorted(concepts[1:], key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)
result = re.sub(target_re, concepts[0], text)请注意,这仍然会导致data mining mining与您原来的一组替换,因为它没有现有mining的概念,在data之后。如果您想避免这种情况,可以简单地将要替换的实际项作为替换目标,以便在较短的时间内匹配:
import re
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
# Sort targets by descending length, so longer targets that
# might contain shorter ones are found first
#
# !!!No [1:] !!!
#
targets = sorted(concepts, key=lambda x: len(x), reverse=True)
# Use re.escape to generate version of the targets with special characters escaped
target_re = "|".join(re.escape(item) for item in targets)
result = re.sub(target_re, concepts[0], text)发布于 2019-01-22 01:01:39
琥珀的溶液很干净。我写了一个很长的表格版本,有一些评论,在单词中穿行,并展望未来,以检查是否匹配。它应该帮助您使用您原来的代码丢失的概念(检查多个单词匹配并避免双重替换),因为它只处理相同数量的单词或单个单词匹配的替换,这将不适用于每一个“概念”列表。
concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"
textSplit = text.split()
finalText = ""
maxX = len(textSplit)
#add a look ahead for mulitwords
for x in range(0, maxX):
tempSplit = concepts[0].split()
tempMax = len(tempSplit)
foundFullMatch = True
for y in range(0,tempMax):
if (x + tempMax <= maxX):
if (textSplit[x+y] != tempSplit[y]):
foundFullMatch = False
else:
foundFullMatch = False
if (foundFullMatch):
#skip past it in the loop
x = x + tempMax
continue
else:
# now start looking at rest of list - make sure is sorted with most words first
for terms in concepts[1:]:
tempSplit2 = terms.split()
tempMax2 = len(tempSplit2)
foundFullMatch = True
for y in range(0,tempMax2):
if (x + tempMax2 <= maxX):
if (textSplit[x+y] != tempSplit2[y]):
foundFullMatch = False
else:
foundFullMatch = False
if (foundFullMatch):
if (tempMax == tempMax2):
# found match same number words - replace
for y in range(0,tempMax2):
textSplit[x+y] = tempSplit[y]
x = x + tempMax
continue
else:
# found match but not same number of words as concept 0
if (tempMax2 == 1):
#covers 1 word answer
textSplit[x] = concepts[0]
continue
print(" ".join(textSplit))https://stackoverflow.com/questions/54299475
复制相似问题