文章/答案/技术大牛

发布

问如何自动标注标签
EN

Stack Overflow用户

提问于 2019-12-25 18:38:10

回答 1查看 44关注 0票数 1

我有一个包含句子列表的csv文件(sentences.csv)，我想要自动标记包含两个动词的句子，特别是/但不仅是；(将原因，得到，可以导致，增加风险，导致它的原因，最有可能)，以及疾病列表从一个名为diseases2.csv的csv文件中，如果存在，则标记1，否则在csv文件本身的新列中标记0。

以下是我到目前为止掌握的一些代码

import csv
import pandas as pd


file=pd.read_csv("sentences.csv")
diseases =pd.read_csv("diseases2.csv")


pattern= '|'.join(diseases['Lists'])


file["useful/unuseful"] = np.where(file["STORY"].str.contains(pattern, na=False),1, 0)


file.to_csv("sentences.csv")

这些是sentences.csv中的示例句子(斜体是动词，粗体是疾病)

STORY
Overeating is the leading cause of obesity
-It is also found that men are more likely to experience heart attacks than women
-Heart attacks are also the second cause of death among cardiovascular diseases.
-According to a statistics released by the Ministry of Health, cardiovascular diseases are the second leading cause of death in UK after cancer.
-A talk about heart attack is presented by Doctor Lee

代码应该在sentences.csv中产生这些结果

STORY                                                                                   useful/unuseful
Overeating is the leading cause of obesity                                                1
It is also found that men are more likely to experience heart attacks than women          1
Heart attacks are also the second cause of death among cardiovascular diseases.           1
According to a statistics released by the Ministry of Health, cardiovascular diseases     1
are the second leading cause of death in UK after cancer.
A talk about heart attack is presented by Doctor Lee                                      0

csv

python

pandas

回答 1

Stack Overflow用户

发布于 2019-12-25 20:17:30

我写了这样的东西，它起作用了。在这段代码中，我从check_list获取单词，因此在本例中，您可以获取带有diseases['Lists'].values.tolist()和其他字符串的check_list。

data['STORY'] = data['STORY'].str.replace('*', '#') # Need this because '*' is special character for regex. Its just temporary.

check_list = ['obesity', 'heart attacks', 'death', 'cause of']  # list of strings that will change label

def is_contains(df, check_list):
    for check in check_list:
        first_check = f'#{check}#'
        second_check = f'##{check}##'
        df.loc[df['STORY'].str.contains(str(first_check)) == True, 'useful/unuseful'] = 1
        df.loc[df['STORY'].str.contains(str(second_check)) == True, 'useful/unuseful'] = 1
    df['useful/unuseful'].fillna(0, inplace=True)
    df['useful/unuseful'] = df['useful/unuseful'].astype(int)
    df['STORY'] = data['STORY'].str.replace('#', '*')

is_contains(data, check_list)

输出为：

                                               STORY  useful/unuseful
0   Overeating is the *leading cause of* **obesity**                1
1  -It is also found that men are more likely *to...                1
2  -Heart attacks are also the second *cause of* ...                1
3  -According to a statistics released by the Min...                0
4  -A talk about heart attack is presented by Doc...                0

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59477532

复制

相似问题

问如何自动标注标签
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何自动标注标签EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何自动标注标签
EN