我有两个数据。像这样的DF1
DF1:
index posts type
-----------------------------------------------------------
0 know intj tool use interaction people excuse a... INTJ
1 rap music ehh opp yeah know valid well know fa... INTJ
2 preferably p hd low except wew lad video p min... INTJ
3 drink like wish could drink red wine give head... INTJ
4 space program ah bad deal meing freelance max ... INTJ
... ... ...
106062 stay frustrate world life want take long nap w... INFP 像这样的DF2:
DF2:
index word emotion value
--------------------------------
0 aback anger 0
1 aback anticipation 0
2 aback disgust 0
3 aback fear 0
4 aback joy 0
... ... ... ..
141535 zoom negative 0
141536 zoom positive 0
141537 zoom sadness 1预期结果:一种新的3列数据格式
DF1.type
type emotions posts_tok
-------------------------------------
0 INTJ [joy,fear] [know, intj, tool, use, interaction, people, e...
1 INTJ O [rap, music, ehh, opp, yeah, know, valid, well...
2 INTJ [sadness] [preferably, p, hd, low, except, wew, lad, vid...
3 INTJ O [drink, like, wish, could, drink, red, wine, g...
4 INTJ O [space, program, ah, bad, deal, meing, freelan...
... ... ... ...
106062 INFP [disgust, anger, fear] [stay, frustrate, world, life, want, take, lon...我的尝试:
common_set=[]
common_emo=[]
#iter over each row in DF1
for key ,valuepost in DF1.iterrows():
#split the value in the current row
listvalues=valuepost['posts'].split()
#iter over the list of the splitted value
for listvalue in listvalues:
#iter over each emotion in DF2.emotion
for _, valueemo in DF2.emotion.items():
# if emotion word matches with an element of the list of listvalue, append it else append a default value
if valueemo == listvalue:
common_emo.append(valueemo)
else:
common_emo.append('O')
common_set.append({'posts_w':listvalue,'emovalue':common_emo,'type':valuepost['type']})
perso_emo_df=pd.DataFrame(common_set)得到的结果:
而失败
请求
你能建议一个优化的方法来获得同样的结果吗?
非常感谢
发布于 2022-07-07 12:04:33
我将两个源DataFrames定义为:
posts type
0 know intj tool interaction people joy fear excuse INTJ
1 rap music opp yeah know valid well know INTJ
2 preferably sadness except wew lad video INTJ
3 drink like wish could drink red wine INTJ
4 space program ah bad deal meing freelance INTJ
5 stay frustrate world take disgust anger fear INFP和
token emotion value
0 aback anger 1
1 aback anticipation 2
2 aback disgust 3
3 aback fear 4
4 aback joy 5
5 zoom negative 11
6 zoom positive 12
7 zoom sadness 13第一步是创建一个额外的列(Word),包含posts
将列转换为单词列表:
DF1['Word'] = DF1.posts.str.split()以及覆盖具有相同内容的posts列:
DF1['posts'] = DF1.posts.str.split()在两列中保留相同内容的原因是,其中一列将很快爆炸,而另一列则应保持原样,以供进一步处理。
下一步是创建DF2的“工作副本”,将情感列作为索引,但将其重命名为Word。
wrk2 = DF2.set_index('emotion')
wrk2.index.name = 'Word'结果是:
token value
Word
anger aback 1
anticipation aback 2
disgust aback 3
fear aback 4
joy aback 5
negative zoom 11
positive zoom 12
sadness zoom 13然后在Word列上“爆发”DF1,将其与wrk2连接,并将结果保存为wrk1 (另一个临时DataFrame):
wrk1 = DF1.explode(column='Word').join(wrk2, on='Word')不久将对每一组进行分组,每一组将按下列职能处理:
def proc(grp):
emo = grp.dropna().Word
row = grp.iloc[0]
return pd.Series([row.type, emo.tolist() if emo.size > 0 else 'O', row.posts],
index=['type', 'emotions', 'posts_tok'])要获得预期的结果,请运行:
result = wrk1.groupby(level=0).apply(proc)对于我的源数据,结果是:
type emotions posts_tok
0 INTJ [joy, fear] [know, intj, tool, interaction, people, joy, f...
1 INTJ O [rap, music, opp, yeah, know, valid, well, know]
2 INTJ [sadness] [preferably, sadness, except, wew, lad, video]
3 INTJ O [drink, like, wish, could, drink, red, wine]
4 INTJ O [space, program, ah, bad, deal, meing, freelance]
5 INFP [disgust, anger, fear] [stay, frustrate, world, take, disgust, anger,...编辑:我最初解决的问题是,联接结果中的行数是DF1的大小、DF1的大小(粗略地)是帖子中的平均字数。
因此,我想出了另一个解决方案--向每一行应用一个函数。
首先,将posts列从单个字符串转换为列表:
DF1.posts = DF1.posts.str.split()然后从DF2中定义一组情感
emos = set(DF2.emotion)然后定义一个函数,该函数将应用于每一行:
def rowProc(row, emos, defVal):
matched = list(set(row.posts) & emos)
return pd.Series([row.type, matched if len(matched) > 0 else defVal,
row.posts], index=['type', 'emotions', 'posts_tok'])要获得结果,请运行:
result = DF1.apply(rowProc, axis=1, args=(emos, 'O'))https://stackoverflow.com/questions/72894421
复制相似问题