假设我们有两个数据帧,称为df,这是我们想要修改的数据框架,而数据帧应该是一个reference data框架;
但是df有时会部分地丢失reference数据中的一些行。
import pandas as pd
df = pd.DataFrame({'gr1f': ['A','A','B','A','B','A','B'],
'gr2p': ['CC','CC','CC','CC','CC','CC','CC'],
'gr3a': ['AL','AL','AL','DEL','DEL','DEL','DEL'],
'DP': [ 'l1yrs', 'l10yrs','l10yrs','l1yrs','l1yrs','l10yrs','l10yrs'],
'num_vals1' :[
1.138044999,
1.221786568,
1.7,
1.159030763,
1.2,
1.5,
1.238341765 ],
'num_vals2' :[
0.166637328,
0.04513741,
0.12,
0.141340498,
0.87,
0.033441602,
0.01233441602]})
df

reference_df = pd.DataFrame({
'gr2p': ['CC','CC','CC','CC','DD','DD','DD','DD'],
'gr3a': ['AL','AL','AL','AL','DEL','DEL','DEL','DEL'],
'DP': ['l1yrs', 'l1yrs', 'l10yrs','l10yrs','l1yrs','l1yrs','l10yrs','l10yrs']})
reference_df

要合并这两个数据框架,我使用了右联接Pandas Merging 101
df1 = df.merge(reference_df[['gr2p','gr3a','DP']], on = ['gr2p','gr3a','DP'], how='right').sort_values(by =['gr1f','gr2p','DP']).reset_index(drop=True)
df1产额

我需要的预期输出
gr1f gr2p gr3a DP num_vals1 num_vals2 source
0 A CC AL l10yrs 1.221787 0.045137 df
1 B CC AL l10yrs 1.700000 0.120000 df
2 A CC AL l1yrs 1.138045 0.166637 df
3 B CC AL l1yrs NaN NaN reference_df
4 A CC DEL l10yrs 1.50000 0.033441602 df
5 B CC DEL l10yrs 1.238342 0.012334 df
6 A CC DEL l1yrs 1.15903 0.1413404 df
7 B CC DEL l1yrs 1.20000 0.87 df
8 A DD DEL l10yrs NaN NaN reference_df
9 B DD DEL l10yrs NaN NaN reference_df
10 A DD DEL l1yrs NaN NaN reference_df
11 B DD DEL l1yrs NaN NaN reference_df如何通过使用gr1f和A组合填充缺失的行以及用'l10yrs,l1yrs‘组合填充DP来实现最后一个表。我认为预期的输出正在显示模式,所以这是一个很好的指南。
发布于 2021-10-20 02:11:47
您可以使用indicator选项。另外,您不是在寻找正确的合并,而是在寻找外部合并:
df1 = (df.merge(reference_df[['gr2p','gr3a','DP']]
.assign(gr1f=['A','B']*(len(reference_df)//2)),
on = ['gr1f','gr2p','gr3a','DP'],
how='outer', indicator=True)
.sort_values(['gr2p','gr3a'])
.assign(source=lambda x: x._merge.map({'both':'df', 'right_only':'ref', 'left_only':'df'}))
.reset_index(drop=True)
)输出:
gr1f gr2p gr3a DP num_vals1 num_vals2 _merge source
0 A CC AL l1yrs 1.138045 0.166637 both df
1 A CC AL l10yrs 1.221787 0.045137 both df
2 B CC AL l10yrs 1.700000 0.120000 both df
3 B CC AL l1yrs NaN NaN right_only ref
4 A CC DEL l1yrs 1.159031 0.141340 left_only df
5 B CC DEL l1yrs 1.200000 0.870000 left_only df
6 A CC DEL l10yrs 1.500000 0.033442 left_only df
7 B CC DEL l10yrs 1.238342 0.012334 left_only df
8 A DD DEL l1yrs NaN NaN right_only ref
9 B DD DEL l1yrs NaN NaN right_only ref
10 A DD DEL l10yrs NaN NaN right_only ref
11 B DD DEL l10yrs NaN NaN right_only refhttps://stackoverflow.com/questions/69639327
复制相似问题