我需要比较两个数据集:
DF1
Subj 1 2 3
0 Biotech Cell culture Bioinfo Immunology
1 Zoology Cell culture Immunology NaN
2 Math Trigonometry Algebra NaN
3 Microbio Biotech NaN NaN
4 Physics Optics NaN NaNDF2
Subj 1 2
0 Biotech Bioinfo Immunology
1 Zoology Immunology Botany
2 Microbio NaN NaN
3 Physics Optics Quantumphy
4 Math Trigonometry NaN 如何想要我的结果数据:
Subj 1 2
0 Biotech Bioinfo Immunology
1 Zoology Immunology NaN
2 Math Trigonometry NaN
3 Physics Optics NaN 我不能逐行检查数据集,因为数据集很大。这两个数据集的列数不同,但行数相同。由于行元素的顺序也不同,所以我不能简单地使用merge()。我尝试了比较函数,但它要么删除所有公共元素,要么形成包含这两个元素的dataframe。我似乎不能只挑出共同的元素。
发布于 2022-11-15 22:28:57
这里有一种方法
列的Understanding:数变化,两个DF中的值不在同一列下
# Stack both the DFs, after setting Subj as index
# this results in changing a wide format to long format
# concat the two DF to forma new DF
df3=pd.concat([df.set_index('Subj').stack().reset_index().rename(columns={0:'val'}),
df2.set_index('Subj').stack().reset_index().rename(columns={0:'val'})],
).reset_index()
# to find the same topic under a subject if it exists in two DFs
# the join will have duplicate rows
# so find the duplicated rows for Subj and Topic (val column)
# group the duplicated rows and aggregate to a comma separated values
# finally split on comma to create new columns
out=(df3[df3.duplicated(subset=['Subj','val'])]
.groupby('Subj')['val']
.agg(','.join)
.str
.split(',',expand=True).reset_index())
out Subj 0 1
0 Biotech Bioinfo Immunology
1 Math Trigonometry None
2 Physics Optics None
3 Zoology Immunology None发布于 2022-11-15 19:00:32
您可以匹配列,然后将subject列设置为索引,同时合并dataframes:
match=df2.columns.intersection(df1.columns).tolist()
df2.merge(df1,on=match, how='left').reindex(df2.columns,axis=1).set_index('Subj').dropna(how='all')返回:
1 2
Subj
Biotech Bioinfo Immunology
Zoology Immunology NaN
Math Trigonometry NaN
Physics Optics NaNhttps://stackoverflow.com/questions/74450386
复制相似问题