首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何自动标注标签

如何自动标注标签
EN

Stack Overflow用户
提问于 2019-12-25 18:38:10
回答 1查看 44关注 0票数 1

我有一个包含句子列表的csv文件(sentences.csv),我想要自动标记包含两个动词的句子,特别是/但不仅是;(将原因,得到,可以导致,增加风险,导致它的原因,最有可能),以及疾病列表从一个名为diseases2.csv的csv文件中,如果存在,则标记1,否则在csv文件本身的新列中标记0。

以下是我到目前为止掌握的一些代码

代码语言:javascript
复制
import csv
import pandas as pd


file=pd.read_csv("sentences.csv")
diseases =pd.read_csv("diseases2.csv")


pattern= '|'.join(diseases['Lists'])


file["useful/unuseful"] = np.where(file["STORY"].str.contains(pattern, na=False),1, 0)


file.to_csv("sentences.csv")

这些是sentences.csv中的示例句子(斜体是动词,粗体是疾病)

代码语言:javascript
复制
STORY
Overeating is the leading cause of obesity
-It is also found that men are more likely to experience heart attacks than women
-Heart attacks are also the second cause of death among cardiovascular diseases.
-According to a statistics released by the Ministry of Health, cardiovascular diseases are the second leading cause of death in UK after cancer.
-A talk about heart attack is presented by Doctor Lee

代码应该在sentences.csv中产生这些结果

代码语言:javascript
复制
STORY                                                                                   useful/unuseful
Overeating is the leading cause of obesity                                                1
It is also found that men are more likely to experience heart attacks than women          1
Heart attacks are also the second cause of death among cardiovascular diseases.           1
According to a statistics released by the Ministry of Health, cardiovascular diseases     1
are the second leading cause of death in UK after cancer.
A talk about heart attack is presented by Doctor Lee                                      0
EN

回答 1

Stack Overflow用户

发布于 2019-12-25 20:17:30

我写了这样的东西,它起作用了。在这段代码中,我从check_list获取单词,因此在本例中,您可以获取带有diseases['Lists'].values.tolist()和其他字符串的check_list

代码语言:javascript
复制
data['STORY'] = data['STORY'].str.replace('*', '#') # Need this because '*' is special character for regex. Its just temporary.

check_list = ['obesity', 'heart attacks', 'death', 'cause of']  # list of strings that will change label

def is_contains(df, check_list):
    for check in check_list:
        first_check = f'#{check}#'
        second_check = f'##{check}##'
        df.loc[df['STORY'].str.contains(str(first_check)) == True, 'useful/unuseful'] = 1
        df.loc[df['STORY'].str.contains(str(second_check)) == True, 'useful/unuseful'] = 1
    df['useful/unuseful'].fillna(0, inplace=True)
    df['useful/unuseful'] = df['useful/unuseful'].astype(int)
    df['STORY'] = data['STORY'].str.replace('#', '*')

is_contains(data, check_list)

输出为:

代码语言:javascript
复制
                                               STORY  useful/unuseful
0   Overeating is the *leading cause of* **obesity**                1
1  -It is also found that men are more likely *to...                1
2  -Heart attacks are also the second *cause of* ...                1
3  -According to a statistics released by the Min...                0
4  -A talk about heart attack is presented by Doc...                0
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59477532

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档