首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >基于if- on条件的合并2数据

基于if- on条件的合并2数据
EN

Stack Overflow用户
提问于 2022-07-07 08:14:25
回答 1查看 56关注 0票数 -1

我有两个数据。像这样的DF1

代码语言:javascript
复制
DF1:

index    posts                                             type
-----------------------------------------------------------
0   know intj tool use interaction people excuse a...   INTJ
1   rap music ehh opp yeah know valid well know fa...   INTJ
2   preferably p hd low except wew lad video p min...   INTJ
3   drink like wish could drink red wine give head...   INTJ
4   space program ah bad deal meing freelance max ...   INTJ
...     ...     ...
106062  stay frustrate world life want take long nap w...   INFP 

像这样的DF2:

代码语言:javascript
复制
DF2:
 index  word    emotion     value
--------------------------------
0   aback   anger   0
1   aback   anticipation    0
2   aback   disgust     0
3   aback   fear    0
4   aback   joy     0
... ...    ...      ..
141535  zoom    negative    0
141536  zoom    positive    0
141537  zoom    sadness     1

预期结果:一种新的3列数据格式

DF1.type

  • emotions:中的
  • 类型--DF2的情感列表。如果DF2.word包含在DF1.posts_tok中,则输入n,否则为'O'
  • posts_tok:拆分的DF1发布

代码语言:javascript
复制
    type emotions    posts_tok
-------------------------------------                                        
0   INTJ [joy,fear] [know, intj, tool, use, interaction, people, e...
1   INTJ  O         [rap, music, ehh, opp, yeah, know, valid, well...
2   INTJ [sadness]  [preferably, p, hd, low, except, wew, lad, vid...
3   INTJ  O         [drink, like, wish, could, drink, red, wine, g...
4   INTJ  O         [space, program, ah, bad, deal, meing, freelan...
...     ...     ...     ...
106062 INFP [disgust, anger, fear] [stay, frustrate, world, life, want, take, lon...

我的尝试:

代码语言:javascript
复制
common_set=[]
common_emo=[]
#iter over each row in DF1
for key ,valuepost in DF1.iterrows():
    
    #split the value in the current row
    listvalues=valuepost['posts'].split()
    #iter over the list of the splitted value
    for listvalue in listvalues:
        #iter over each emotion in DF2.emotion
        for _, valueemo in DF2.emotion.items():
        # if emotion word matches with an element of the list of listvalue, append it else append a default value
            if valueemo == listvalue:
                common_emo.append(valueemo)
            else:
                common_emo.append('O')
    common_set.append({'posts_w':listvalue,'emovalue':common_emo,'type':valuepost['type']})

perso_emo_df=pd.DataFrame(common_set)

得到的结果:

  • 由于内存不足

而失败

请求

你能建议一个优化的方法来获得同样的结果吗?

非常感谢

EN

回答 1

Stack Overflow用户

发布于 2022-07-07 12:04:33

我将两个源DataFrames定义为:

代码语言:javascript
复制
                                               posts  type
0  know intj tool interaction people joy fear excuse  INTJ
1            rap music opp yeah know valid well know  INTJ
2            preferably sadness except wew lad video  INTJ
3               drink like wish could drink red wine  INTJ
4          space program ah bad deal meing freelance  INTJ
5       stay frustrate world take disgust anger fear  INFP

代码语言:javascript
复制
   token       emotion  value
0  aback         anger      1
1  aback  anticipation      2
2  aback       disgust      3
3  aback          fear      4
4  aback           joy      5
5   zoom      negative     11
6   zoom      positive     12
7   zoom       sadness     13

第一步是创建一个额外的列(Word),包含posts

将列转换为单词列表:

代码语言:javascript
复制
DF1['Word'] = DF1.posts.str.split()

以及覆盖具有相同内容的posts列:

代码语言:javascript
复制
DF1['posts'] = DF1.posts.str.split()

在两列中保留相同内容的原因是,其中一列将很快爆炸,而另一列则应保持原样,以供进一步处理。

下一步是创建DF2的“工作副本”,将情感列作为索引,但将其重命名为Word。

代码语言:javascript
复制
wrk2 = DF2.set_index('emotion')
wrk2.index.name = 'Word'

结果是:

代码语言:javascript
复制
              token  value
Word                      
anger         aback      1
anticipation  aback      2
disgust       aback      3
fear          aback      4
joy           aback      5
negative       zoom     11
positive       zoom     12
sadness        zoom     13

然后在Word列上“爆发”DF1,将其与wrk2连接,并将结果保存为wrk1 (另一个临时DataFrame):

代码语言:javascript
复制
wrk1 = DF1.explode(column='Word').join(wrk2, on='Word')

不久将对每一组进行分组,每一组将按下列职能处理:

代码语言:javascript
复制
def proc(grp):
    emo = grp.dropna().Word
    row = grp.iloc[0]
    return pd.Series([row.type, emo.tolist() if emo.size > 0 else 'O', row.posts],
        index=['type', 'emotions', 'posts_tok'])

要获得预期的结果,请运行:

代码语言:javascript
复制
result = wrk1.groupby(level=0).apply(proc)

对于我的源数据,结果是:

代码语言:javascript
复制
   type                emotions                                          posts_tok
0  INTJ             [joy, fear]  [know, intj, tool, interaction, people, joy, f... 
1  INTJ                       O   [rap, music, opp, yeah, know, valid, well, know] 
2  INTJ               [sadness]     [preferably, sadness, except, wew, lad, video] 
3  INTJ                       O       [drink, like, wish, could, drink, red, wine] 
4  INTJ                       O  [space, program, ah, bad, deal, meing, freelance] 
5  INFP  [disgust, anger, fear]  [stay, frustrate, world, take, disgust, anger,...

编辑:我最初解决的问题是,联接结果中的行数是DF1的大小、DF1的大小(粗略地)是帖子中的平均字数。

因此,我想出了另一个解决方案--向每一行应用一个函数。

首先,将posts列从单个字符串转换为列表:

代码语言:javascript
复制
DF1.posts = DF1.posts.str.split()

然后从DF2中定义一组情感

代码语言:javascript
复制
emos = set(DF2.emotion)

然后定义一个函数,该函数将应用于每一行:

代码语言:javascript
复制
def rowProc(row, emos, defVal):
    matched = list(set(row.posts) & emos)
    return pd.Series([row.type, matched if len(matched) > 0 else defVal,
        row.posts],  index=['type', 'emotions', 'posts_tok'])

要获得结果,请运行:

代码语言:javascript
复制
result = DF1.apply(rowProc, axis=1, args=(emos, 'O'))
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72894421

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档