我有一个包含句子列表的csv文件(sentences.csv),我想要自动标记包含两个动词的句子,特别是/但不仅是;(将原因,得到,可以导致,增加风险,导致它的原因,最有可能),以及疾病列表从一个名为diseases2.csv的csv文件中,如果存在,则标记1,否则在csv文件本身的新列中标记0。
以下是我到目前为止掌握的一些代码
import csv
import pandas as pd
file=pd.read_csv("sentences.csv")
diseases =pd.read_csv("diseases2.csv")
pattern= '|'.join(diseases['Lists'])
file["useful/unuseful"] = np.where(file["STORY"].str.contains(pattern, na=False),1, 0)
file.to_csv("sentences.csv")这些是sentences.csv中的示例句子(斜体是动词,粗体是疾病)
STORY
Overeating is the leading cause of obesity
-It is also found that men are more likely to experience heart attacks than women
-Heart attacks are also the second cause of death among cardiovascular diseases.
-According to a statistics released by the Ministry of Health, cardiovascular diseases are the second leading cause of death in UK after cancer.
-A talk about heart attack is presented by Doctor Lee代码应该在sentences.csv中产生这些结果
STORY useful/unuseful
Overeating is the leading cause of obesity 1
It is also found that men are more likely to experience heart attacks than women 1
Heart attacks are also the second cause of death among cardiovascular diseases. 1
According to a statistics released by the Ministry of Health, cardiovascular diseases 1
are the second leading cause of death in UK after cancer.
A talk about heart attack is presented by Doctor Lee 0发布于 2019-12-25 20:17:30
我写了这样的东西,它起作用了。在这段代码中,我从check_list获取单词,因此在本例中,您可以获取带有diseases['Lists'].values.tolist()和其他字符串的check_list。
data['STORY'] = data['STORY'].str.replace('*', '#') # Need this because '*' is special character for regex. Its just temporary.
check_list = ['obesity', 'heart attacks', 'death', 'cause of'] # list of strings that will change label
def is_contains(df, check_list):
for check in check_list:
first_check = f'#{check}#'
second_check = f'##{check}##'
df.loc[df['STORY'].str.contains(str(first_check)) == True, 'useful/unuseful'] = 1
df.loc[df['STORY'].str.contains(str(second_check)) == True, 'useful/unuseful'] = 1
df['useful/unuseful'].fillna(0, inplace=True)
df['useful/unuseful'] = df['useful/unuseful'].astype(int)
df['STORY'] = data['STORY'].str.replace('#', '*')
is_contains(data, check_list)输出为:
STORY useful/unuseful
0 Overeating is the *leading cause of* **obesity** 1
1 -It is also found that men are more likely *to... 1
2 -Heart attacks are also the second *cause of* ... 1
3 -According to a statistics released by the Min... 0
4 -A talk about heart attack is presented by Doc... 0https://stackoverflow.com/questions/59477532
复制相似问题