我有一个熊猫数据框架:
>>df_freq = pd.DataFrame([["Z11", "Z11", "X11"], ["Y11","",""], ["Z11","Z11",""]], columns=list('ABC'))
>>df_freq
A B C
0 Z11 Z11 X11
1 Y11
2 Z11 Z11 我希望确保每一行只具有唯一的值。因此,它应该变成这样:删除的值可以替换为零或空。
A B C
0 Z11 0 X11
1 Y11
2 Z11 0 我的数据框架很大,有数百列和数千行。目标是计算该数据框架中的唯一值。我通过将数据帧转换为矩阵并应用于
>>np.unique(mat.astype(str), return_counts=True)但是在某些行中会出现相同的值,我想在应用np.unique()方法之前删除它。我希望在每一行中保留唯一的值。
发布于 2017-05-10 17:24:02
使用astype(bool)和duplicated的组合
mask = df_freq.apply(pd.Series.duplicated, 1) & df_freq.astype(bool)
df_freq.mask(mask, 0)
A B C
0 Z11 0 X11
1 Y11
2 Z11 0 发布于 2017-05-10 17:36:37
这里有一个矢量化的NumPy方法-
def reset_rowwise_dups(df):
n = df.shape[0]
row_idx = np.arange(n)[:,None]
a = df_freq.values
idx = np.argsort(a,1)
sorted_a = a[row_idx, idx]
idx_reversed = idx.argsort(1)
sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
a[final_mask] = 0样本运行-
In [80]: df_freq
Out[80]:
A B C D
0 Z11 Z11 X11 Z11
1 Y11 Y11
2 Z11 Z11 X11
In [81]: reset_rowwise_dups(df_freq)
In [82]: df_freq
Out[82]:
A B C D
0 Z11 0 X11 0
1 Y11 0
2 Z11 0 X11运行时测试
# Proposed earlier in this post
def reset_rowwise_dups(df):
n = df.shape[0]
row_idx = np.arange(n)[:,None]
a = df.values
idx = np.argsort(a,1)
sorted_a = a[row_idx, idx]
idx_reversed = idx.argsort(1)
sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
a[final_mask] = 0
# @piRSquared's soln using pandas apply
def apply_based(df):
mask = df.apply(pd.Series.duplicated, 1) & df.astype(bool)
return df.mask(mask, 0)时间安排-
In [151]: df_freq = pd.DataFrame([["Z11", "Z11", "X11", "Z11"], \
...: ["Y11","","", "Y11"],["Z11","Z11","","X11"]], columns=list('ABCD'))
In [152]: df_freq
Out[152]:
A B C D
0 Z11 Z11 X11 Z11
1 Y11 Y11
2 Z11 Z11 X11
In [153]: df = pd.concat([df_freq]*10000,axis=0)
In [154]: df.index = range(df.shape[0])
In [155]: %timeit apply_based(df)
1 loops, best of 3: 3.35 s per loop
In [156]: %timeit reset_rowwise_dups(df)
100 loops, best of 3: 12.7 ms per loop发布于 2017-05-10 17:26:54
def replaceDuplicateData(nestedList):
for row in range(len(nestedList)):
uniqueDataRow = []
for col in range(len(nestedList[row])):
if nestedList[row][col] not in uniqueDataRow:
uniqueDataRow.append(nestedList[row][col])
else:
nestedList[row][col] = 0
return nestedList
nestedList = [["Z11", "Z11", "X11"], ["Y11","",""], ["Z11","Z11",""]]
print (replaceDuplicateData(nestedList))基本上,您可以使用上面的函数来消除矩阵中的重复。
https://stackoverflow.com/questions/43898903
复制相似问题