我有下面的数据:
df=pd.DataFrame({'cnpj':[410000132,410000132,4830624000197,4830624000197,4830624000197],'Nome Pessoa':['EUGENIO LUPORINI NETO','JUAN MATIAS SERAGOPIAN','EUGENIO LUPORINI NETO','SIMONE FANKHAUSER','ALEX SOUZA']})
print(df)
cnpj Nome Pessoa
0 410000132 EUGENIO LUPORINI NETO
1 410000132 JUAN MATIAS SERAGOPIAN
2 4830624000197 EUGENIO LUPORINI NETO
3 4830624000197 SIMONE FANKHAUSER
4 4830624000197 ALEX SOUZA每个cnpj都是一家公司。每个Nome Pessoa都是一个人。我想列出的是,对于其他人以与他相同的cnpj出现到的每个cnpj(最好没有副本)。换句话说,我将列出使用cnpj作为键的人是如何关联的,其方式是df看起来像这样(或者至少接近它):
cnpj Nome Pessoa Relations
0 410000132 EUGENIO LUPORINI NETO ['JUAN MATIAS SERAGOPIAN','SIMONE FANKHAUSER','ALEX SOUZA']
1 410000132 JUAN MATIAS SERAGOPIAN ['EUGENIO LUPORINI NETO']
2 4830624000197 EUGENIO LUPORINI NETO ['JUAN MATIAS SERAGOPIAN','SIMONE FANKHAUSER','ALEX SOUZA']
3 4830624000197 SIMONE FANKHAUSER ['EUGENIO LUPORINI NETO','ALEX SOUZA']
4 4830624000197 ALEX SOUZA ['EUGENIO LUPORINI NETO','SIMONE FANKHAUSER']例如,df['Relations'][0] = ['JUAN MATIAS SERAGOPIAN','SIMONE FANKHAUSER','ALEX SOUZA']是这样的,因为与EUGENIO (410000132)在同一个cnpj中出现,SIMONE和ALEX与EUGENIO (4830624000197)一起出现在另一个cnpj中。
我想,这可能是在组的领域,但不确定如何实现它。
发布于 2020-06-17 14:36:17
您可以将apply与其内部的查询一起使用,并将结果附加到DataFrame:
def get_relations(row, df):
row_cnpj = row['cnpj']
row_name = row['Nome Pessoa']
query = df.query('cnpj == @row_cnpj and `Nome Pessoa` != @row_name')
row['Relations'] = query['Nome Pessoa'].values
return row
df = df.apply(lambda x: get_relations(x, df), axis=1)发布于 2020-06-17 14:30:44
以下工作:
In[0]:
def add_relations(row):
current_name = row['Nome Pessoa']
cnpjs = df[df['Nome Pessoa'] == current_name]['cnpj']
relations = df['cnpj'].isin(cnpjs)
output = df[relations]['Nome Pessoa'][df['Nome Pessoa'] != current_name]
return list(output)
df['Relations'] = df.apply(add_relations, axis=1)
df
Out[0]:
cnpj Nome Pessoa \
0 410000132 EUGENIO LUPORINI NETO
1 410000132 JUAN MATIAS SERAGOPIAN
2 4830624000197 EUGENIO LUPORINI NETO
3 4830624000197 SIMONE FANKHAUSER
4 4830624000197 ALEX SOUZA
Relations
0 [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...
1 [EUGENIO LUPORINI NETO]
2 [JUAN MATIAS SERAGOPIAN, SIMONE FANKHAUSER, AL...
3 [EUGENIO LUPORINI NETO, ALEX SOUZA]
4 [EUGENIO LUPORINI NETO, SIMONE FANKHAUSER]这使用apply,所以并不是最优的,但取决于您拥有多少数据,这可能是很好的。
更新:我也尝试用groupby做一些东西,并想出了以下同样有效的方法,但也感觉不太理想,因为它使用了groupby 2x,并且具有非常糟糕的列表理解能力。我觉得有一个更好的答案,但这是逃避我..。
num_to_name = df.groupby('cnpj')['Nome Pessoa'].apply(list)
name_to_num = df.groupby('Nome Pessoa')['cnpj'].apply(list)
df['Relations'] = df['Nome Pessoa'].map(name_to_num)
df['Relations'] = [[x for x in num_to_name.loc[df.loc[i,'Relations']].values.sum()
if x != df.loc[i, 'Nome Pessoa']] for i in df.index]感谢this answer在上面提供的帮助。
https://stackoverflow.com/questions/62430290
复制相似问题