我有一个数据文件: df1
SAP_Name SAP_Class SAP_Sec
Avi 5 Avi
Rison 6 A
Slesh 7 B
San 8 C
Sud 7 B df2:
Name_Fi Class std
Avi 5 fgh
Rison 6 Rij
Slesh 7 jkh我正在尝试获取标头的映射,这些值包含在两个数据文件的列中。
当使用以下代码时:
dfs1 = {col1: df1[col1].drop_duplicates() for col1 in df1.columns}
set2 = {col2: set(df2[col2]) for col2 in df2.columns}
d = {}
for col2, v2 in set2.items():
for col1, v1 in dfs1.items():
cond = set(v2).issubset(v1)
if cond:
d[col2] = col1
print(d)我的结果是:
{'Name_Fi': 'SAP_Name', 'Class': 'SAP_Class'}期望产出:
{'Name_Fi': ['SAP_Name','SAP_Sec'], 'Class': 'SAP_Class'}基本上,只要在df1中有匹配的值,就可以打印每个匹配的列标题。伤寒能帮上忙吗?
发布于 2018-07-30 09:14:57
我认为需要在intersection之间使用set,因此为了性能起见,在dict comprehension中将字典的两个值转换为set:
from collections import defaultdict
dfs1 = {col1: set(df1[col1].drop_duplicates()) for col1 in df1.columns}
set2 = {col2: set(df2[col2]) for col2 in df2.columns}
d = defaultdict(list)
for col2, v2 in set2.items():
for col1, v1 in dfs1.items():
cond = v2.intersection(v1)
if cond:
d[col2].append(col1)
print(d)
defaultdict(<class 'list'>, {'Name_Fi': ['SAP_Name', 'SAP_Sec'], 'Class': ['SAP_Class']})发布于 2018-07-30 09:15:24
这里不需要issubset,因为只有当一列中的值是另一列值的子集时才能工作,例如,df2['Name_Fi']的所有值都包含在df1['SAP_Sec']中。相反,您可以使用set.intersection或语法糖&来检查是否存在重叠。
您可以将这个想法与collections.defaultdict结合起来,提取您想要的结果:
from collections import defaultdict
d = defaultdict(list)
for col2, v2 in set2.items():
for col1, v1 in dfs1.items():
cond = set(v2) & set(v1)
if cond:
d[col2].append(col1)
print(d)
defaultdict(<class 'list'>, {'Name_Fi': ['SAP_Name', 'SAP_Sec'],
'Class': ['SAP_Class']})更有效地,您可以在理解中计算set对象,以避免嵌套循环中的显式转换:
from collections import defaultdict
dfs1 = {col1: set(df1[col1]) for col1 in df1}
set2 = {col2: set(df2[col2]) for col2 in df2}
d = defaultdict(list)
for col2, v2 in set2.items():
for col1, v1 in dfs1.items():
if v2 & v1:
d[col2].append(col1)
print(d)对于较大的数组,您可能会发现使用pd.Series.unique + any和生成器表达式更有效:
import pandas as pd
import numpy as np
np.random.seed(0)
A = pd.Series(np.random.randint(0, 1000, 10000)).unique()
B = pd.Series(np.random.randint(0, 100000, 10000)).unique()
%timeit np.in1d(B, A).any() # 1.1 ms per loop
%timeit set(A) & set(B) # 1.46 ms per loop
%timeit any(i in B for i in A) # 102 µs per loophttps://stackoverflow.com/questions/51590328
复制相似问题