我有以下数据:
Sentence
0 Cat is a big lion
1 Dogs are descendants of wolf
2 Elephants are pachyderm
3 Pachyderm animals include rhino, Elephants and hippopotamus我需要创建一个python代码,它查看上面句子中的单词,并根据以下不同的数据框架计算每个单词的得分之和。
Name Score
cat 1
dog 2
wolf 2
lion 3
elephants 5
rhino 4
hippopotamus 5例如,对于第0行,得分为1 (cat) +3(狮子)=4。
我希望创建一个如下所示的输出。
Sentence Value
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and hippopotamus 14发布于 2018-09-10 19:40:40
首先,您可以尝试split和map-based方法,然后使用groupby计算分数。
v = df1['Sentence'].str.split(r'[\s.!?,]+', expand=True).stack().str.lower()
df1['Value'] = (
v.map(df2.set_index('Name')['Score'])
.sum(level=0)
.fillna(0, downcast='infer'))
df1
Sentence Value
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4 # s/dog/dogs in df2
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and... 14发布于 2018-09-10 19:50:14
nltk
你可能需要下载一些东西
import nltk
nltk.download('punkt')然后设置词干和标记。
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()创建一本方便的字典
m = dict(zip(map(ps.stem, scores.Name), scores.Score))产生分数
def f(s):
return sum(filter(None, map(m.get, map(ps.stem, word_tokenize(s)))))
df.assign(Score=[*map(f, df.Sentence)])
Sentence Score
0 Cat is a big lion 4
1 Dogs are descendants of wolf 4
2 Elephants are pachyderm 5
3 Pachyderm animals include rhino, Elephants and... 14发布于 2018-09-10 20:32:47
尝试将findall与re re.I结合使用
df.Sentence.str.findall(df1.Name.str.cat(sep='|'),flags=re.I).\
map(lambda x : sum([df1.loc[df1.Name==str.lower(y),'Score' ].values for y in x])[0])
Out[49]:
0 4
1 4
2 5
3 14
Name: Sentence, dtype: int64https://stackoverflow.com/questions/52264354
复制相似问题