文章/答案/技术大牛

发布

社区首页 >问答首页 >从给定的文本创建新字典

问从给定的文本创建新字典
EN

Stack Overflow用户

提问于 2021-06-06 16:49:37

回答 1查看 239关注 0票数 1

我有以下变量

data = ("Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country. Many people have been killed that day.",
        {"entities": [(48, 54, 'Category 1'), (77, 81, 'Category 1'), (111, 118, 'Category 2'), (150, 173, 'Category 3')]})

data[1]['entities'][0] = (48, 54, 'Category 1')代表(start_offset, end_offset, entity)。

我想读取data[0]的每个单词，并根据data[1]实体标记它。我期待着最终的产出，

{
'Thousands': 'O', 
'of': 'O',
'demonstrators': 'O',
'have': 'O',
'marched': 'O',
'through': 'O',
'London': 'S-1',
'to': 'O', 
'protest': 'O', 
'the': 'O', 
'war': 'O', 
'in': 'O', 
'Iraq': 'S-1',
'and': 'O' 
'demand': 'O', 
'the': 'O', 
'withdrawal': 'O', 
'of': 'O', 
'British': 'S-2', 
'troops': 'O', 
'from': 'O',
'that': 'O', 
'country': 'O',
'.': 'O',
'Many': 'O', 
'people': 'S-3', 
'have': 'B-3', 
'been': 'B-3', 
'killed': 'E-3', 
'that': 'O', 
'day': 'O',
'.': 'O'
}

在这里，“O”代表“OutOfEntity”，“S”代表“开始”，“B”代表“中间”，而“E”代表“结束”，对于每一给定的文本都是独一无二的。

我尝试了以下几点：

entities = {}
offsets = data[1]['entities']
for entity in offsets:
    entities[data[0][entity[0]:entity[1]]] = re.findall('[0-9]+', entity[2])[0]

tags = {}
for key, value in entities.items():
    entity = key.split()
    if len(entity) > 1:
        bEntity = entity[1:-1]
        tags[entity[0]] = 'S-'+value
        tags[entity[-1]] = 'E-'+value
        for item in bEntity:
            tags[item] = 'B-'+value
    else:
        tags[entity[0]] = 'S-'+value

输出将是

{'London': 'S-1',
 'Iraq': 'S-1',
 'British': 'S-2',
 'people': 'S-3',
 'killed': 'E-3',
 'have': 'B-3',
 'been': 'B-3'}

从这一点开始，我被困在如何处理'O‘实体上。另外，我希望构建更高效、更易读的代码。我认为字典数据结构不能更有效地工作，因为我可以拥有与它们作为键的相同的单词。

python

dictionary

nlp

named-entity-recognition

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-06-06 19:37:49

def ner(data):
    entities = {}
    offsets = data[1]['entities']
    for entity in offsets:
        entities[data[0][int(entity[0]):int(entity[1])]] = re.findall('[0-9]+', entity[2])[0]
    
    tags = []
    for key, value in entities.items():
        entity = key.split()
        if len(entity) > 1:
            bEntity = entity[1:-1]
            tags.append((entity[0], 'S-'+value))
            for item in bEntity:
                tags.append((item, 'B-'+value))
            tags.append((entity[-1], 'E-'+value))
        else:
            tags.append((entity[0], 'S-'+value))
    
    tokens = nltk.word_tokenize(data[0])
    OTokens = [(token, 'O') for token in tokens if token not in [token[0] for token in tags]]
    for token in OTokens:
        tags.append(token)
    
    return tags

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67861522

复制

相似问题

问从给定的文本创建新字典
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从给定的文本创建新字典EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从给定的文本创建新字典
EN