文章/答案/技术大牛

发布

社区首页 >问答首页 >迭代段落进行字符串抽取

问迭代段落进行字符串抽取
EN

Stack Overflow用户

提问于 2016-10-14 20:33:40

回答 1查看 672关注 0票数 0

我在一个论坛上发现了这个问题，在这个论坛上需要做这些事情:你将得到一个段落序列，并且必须过滤掉任何一个段落(空格分隔的单词序列)完全包含在一个或多个其他段落的子段落中。

在对包含进行比较时，必须遵循某些规则:应忽略字母字符的大小写，忽略尾随空格。任何其他连续空格块都应被视为一个单独的非字母数字字符，空白也必须被过滤--如果两个段落被认为与上面列出的比较规则相同，则只应该保留最短的段落。如果它们也是相同的长度，则应该保留输入序列中的前一个长度。保留的段落应以原始形式输出(与输入段落相同)，并按相同的顺序输出。

Input1: IBM认知计算ibm“认知”计算是一场革命IBM认知计算所是一场革命吗？

Output1: IBM“认知”计算是一场革命

Input2: IBM认知计算IBM“认知”计算是一场革命认知计算是一场革命

Output2: IBM“认知”计算是一场革命认知计算是一场革命

我用python编写了以下代码，但它给了我一些其他的输出，而不是第一个测试用例：

f = open("input.txt",'r')
s = (f.read()).split('|')
str = ''
for a in s:
    for b in s:
        if(''.join(e for e in a.lower() if e.isalnum()))not in (''.join(e for e in b.lower() if e.isalnum())):
            str = a.translate(None, "'?")

print str

input.txt包含第一个测试用例输入。我得到的输出是：，IBM认知计算是一场革命，。有人能帮我一下吗？谢谢

python

string

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-10-15 00:42:33

我真的为您编写了这个代码，希望它很容易理解(我使它变得不那么优化了)。如果你需要加快速度，或者你有什么问题，请告诉我！

顺便说一句，这是python 3，只要去掉print语句中的括号，它就会复制、粘贴和运行python 2。

import re

def clean_input(text):
    #non-alphanumeric character should be ignored
    text = re.sub('[^a-zA-Z\s]', '', text)
    #Any other block of contiguous whitespace should be treated as a single space
    #white space should be retained
    text = re.sub(' +',' ',text)
    #Leading and trailing whitespace should be ignored
    text = text.strip(' \t\n\r')
    # You probably want this too
    text = text.lower()
    return text

def process(text):
    #If they are also the same length, the earlier one in the input sequence should be kept.
    # Using arrays (can use OrderedDict too, probably easier and nice, although below is clearer for you.
    original_parts = text.split('|')
    clean_parts = [clean_input(x) for x in original_parts]
    original_parts_to_check = []
    ignore_idx = []
    for idx, ele in enumerate(original_parts):
        if idx in ignore_idx:
            continue
        #The case of alphabetic characters should be ignored
        if len(ele) < 2:
            continue
        #Duplicates must also be filtered -if two passages are considered equal with respect to the comparison rules listed above, only the shortest should be retained.
        if clean_parts[idx] in clean_parts[:idx]+clean_parts[idx+1:]:
            indices = [i for i, x in enumerate(clean_parts) if x == clean_parts[idx]]
            use = indices[0]
            for i in indices[1:]:
                if len(original_parts[i]) < len(original_parts[use]):
                    use = i
            if idx == use:
                ignore_idx += indices
            else:
                ignore_idx += [x for x in indices if x != use]
                continue
        original_parts_to_check.append(idx)
    # Doing the text in text matching here. Depending on size and type of dataset,
    # Which you should test as it would affect this, you may want this as part
    # of the function above, or before, etc. If you're only doing 100 bits of
    # data, then it doesn't matter. Optimize accordingly.
    text_to_return = []
    clean_to_check = [clean_parts[x] for x in original_parts_to_check]
    for idx in original_parts_to_check:
        # This bit can be done better, but I have no more time to work on this.
        if any([(clean_parts[idx] in clean_text) for clean_text in [x for x in clean_to_check if x != clean_parts[idx]]]):
            continue
        text_to_return.append(original_parts[idx])
    #The retained passages should be output in their original form (identical to the input passage), and in the same order.
    return '|'.join(text_to_return)

assert(process('IBM cognitive computing|IBM "cognitive" computing is a revolution| ibm cognitive computing|\'IBM Cognitive Computing\' is a revolution?') ==
       'IBM "cognitive" computing is a revolution')
print(process('IBM cognitive computing|IBM "cognitive" computing is a revolution| ibm cognitive computing|\'IBM Cognitive Computing\' is a revolution?'))
assert(process('IBM cognitive computing|IBM "cognitive" computing is a revolution|the cognitive computing is a revolution') ==
       'IBM "cognitive" computing is a revolution|the cognitive computing is a revolution')
print(process('IBM cognitive computing|IBM "cognitive" computing is a revolution|the cognitive computing is a revolution'))

而且，如果这对你有帮助的话，得到一些分数会很好，所以接受会很好:) (我看你是新来的)。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40051565

复制

相似问题

问迭代段落进行字符串抽取
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问迭代段落进行字符串抽取EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问迭代段落进行字符串抽取
EN