文章/答案/技术大牛

发布

社区首页 >问答首页 >基于if- on条件的合并2数据

问基于if- on条件的合并2数据
EN

Stack Overflow用户

提问于 2022-07-07 08:14:25

回答 1查看 56关注 0票数 -1

我有两个数据。像这样的DF1

DF1:

index    posts                                             type
-----------------------------------------------------------
0   know intj tool use interaction people excuse a...   INTJ
1   rap music ehh opp yeah know valid well know fa...   INTJ
2   preferably p hd low except wew lad video p min...   INTJ
3   drink like wish could drink red wine give head...   INTJ
4   space program ah bad deal meing freelance max ...   INTJ
...     ...     ...
106062  stay frustrate world life want take long nap w...   INFP

像这样的DF2：

DF2:
 index  word    emotion     value
--------------------------------
0   aback   anger   0
1   aback   anticipation    0
2   aback   disgust     0
3   aback   fear    0
4   aback   joy     0
... ...    ...      ..
141535  zoom    negative    0
141536  zoom    positive    0
141537  zoom    sadness     1

预期结果：一种新的3列数据格式

DF1.type

emotions:中的
类型--DF2的情感列表。如果DF2.word包含在DF1.posts_tok中，则输入n，否则为'O'
posts_tok:拆分的DF1发布

    type emotions    posts_tok
-------------------------------------                                        
0   INTJ [joy,fear] [know, intj, tool, use, interaction, people, e...
1   INTJ  O         [rap, music, ehh, opp, yeah, know, valid, well...
2   INTJ [sadness]  [preferably, p, hd, low, except, wew, lad, vid...
3   INTJ  O         [drink, like, wish, could, drink, red, wine, g...
4   INTJ  O         [space, program, ah, bad, deal, meing, freelan...
...     ...     ...     ...
106062 INFP [disgust, anger, fear] [stay, frustrate, world, life, want, take, lon...

我的尝试：

common_set=[]
common_emo=[]
#iter over each row in DF1
for key ,valuepost in DF1.iterrows():
    
    #split the value in the current row
    listvalues=valuepost['posts'].split()
    #iter over the list of the splitted value
    for listvalue in listvalues:
        #iter over each emotion in DF2.emotion
        for _, valueemo in DF2.emotion.items():
        # if emotion word matches with an element of the list of listvalue, append it else append a default value
            if valueemo == listvalue:
                common_emo.append(valueemo)
            else:
                common_emo.append('O')
    common_set.append({'posts_w':listvalue,'emovalue':common_emo,'type':valuepost['type']})

perso_emo_df=pd.DataFrame(common_set)

得到的结果：

由于内存不足

而失败

请求

你能建议一个优化的方法来获得同样的结果吗？

非常感谢

pandas

dataframe

python

回答 1

Stack Overflow用户

发布于 2022-07-07 12:04:33

我将两个源DataFrames定义为：

                                               posts  type
0  know intj tool interaction people joy fear excuse  INTJ
1            rap music opp yeah know valid well know  INTJ
2            preferably sadness except wew lad video  INTJ
3               drink like wish could drink red wine  INTJ
4          space program ah bad deal meing freelance  INTJ
5       stay frustrate world take disgust anger fear  INFP

和

   token       emotion  value
0  aback         anger      1
1  aback  anticipation      2
2  aback       disgust      3
3  aback          fear      4
4  aback           joy      5
5   zoom      negative     11
6   zoom      positive     12
7   zoom       sadness     13

第一步是创建一个额外的列(Word)，包含posts

将列转换为单词列表：

DF1['Word'] = DF1.posts.str.split()

以及覆盖具有相同内容的posts列：

DF1['posts'] = DF1.posts.str.split()

在两列中保留相同内容的原因是，其中一列将很快爆炸，而另一列则应保持原样，以供进一步处理。

下一步是创建DF2的“工作副本”，将情感列作为索引，但将其重命名为Word。

wrk2 = DF2.set_index('emotion')
wrk2.index.name = 'Word'

结果是：

              token  value
Word                      
anger         aback      1
anticipation  aback      2
disgust       aback      3
fear          aback      4
joy           aback      5
negative       zoom     11
positive       zoom     12
sadness        zoom     13

然后在Word列上“爆发”DF1，将其与wrk2连接，并将结果保存为wrk1 (另一个临时DataFrame)：

wrk1 = DF1.explode(column='Word').join(wrk2, on='Word')

不久将对每一组进行分组，每一组将按下列职能处理：

def proc(grp):
    emo = grp.dropna().Word
    row = grp.iloc[0]
    return pd.Series([row.type, emo.tolist() if emo.size > 0 else 'O', row.posts],
        index=['type', 'emotions', 'posts_tok'])

要获得预期的结果，请运行：

result = wrk1.groupby(level=0).apply(proc)

对于我的源数据，结果是：

   type                emotions                                          posts_tok
0  INTJ             [joy, fear]  [know, intj, tool, interaction, people, joy, f... 
1  INTJ                       O   [rap, music, opp, yeah, know, valid, well, know] 
2  INTJ               [sadness]     [preferably, sadness, except, wew, lad, video] 
3  INTJ                       O       [drink, like, wish, could, drink, red, wine] 
4  INTJ                       O  [space, program, ah, bad, deal, meing, freelance] 
5  INFP  [disgust, anger, fear]  [stay, frustrate, world, take, disgust, anger,...

编辑：我最初解决的问题是，联接结果中的行数是DF1的大小、DF1的大小(粗略地)是帖子中的平均字数。

因此，我想出了另一个解决方案--向每一行应用一个函数。

首先，将posts列从单个字符串转换为列表：

DF1.posts = DF1.posts.str.split()

然后从DF2中定义一组情感

emos = set(DF2.emotion)

然后定义一个函数，该函数将应用于每一行：

def rowProc(row, emos, defVal):
    matched = list(set(row.posts) & emos)
    return pd.Series([row.type, matched if len(matched) > 0 else defVal,
        row.posts],  index=['type', 'emotions', 'posts_tok'])

要获得结果，请运行：

result = DF1.apply(rowProc, axis=1, args=(emos, 'O'))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72894421

复制

相似问题

问基于if- on条件的合并2数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于if- on条件的合并2数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于if- on条件的合并2数据
EN