我有两个长度不等的数据,其中第一个数据行将根据第二个数据帧的范围进行过滤。要获得更好的I/O上下文,请参阅以下文章:Efficient solution of dataframe range filtering based on another ranges
例如:
M = (x,y) = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = (m,n) = [(5,7), (15,16), (15,18), (21,25), (5,25), (5,15)]
Case-1:
M = [(10,20)]
E = [(5,7)]
out: M = [(10,20)] (no change, because out of E range)
Case-2:
M = [(10,20)]
E = [(15,16)]
out: M = [(10,14),(17,20)] (split (10,20) into 2 rows to remove E range inside it)
Case-3:
M = [(10,20)]
E = [(21,25)]
out: M = [(10,20)] (no change, because out of E range)
Case-4:
M = [(10,20)]
E = [(5,25)]
out: M = [] (delete because totally inclusive within range of E)
Case-5:
M = [(10,20)]
E = [(5,15)]
out: M = [(16,20)] (because (16,20) isn't E range inclusive)
Case-6:
M = [(10,20)]
E = [(13,20)]
out: M = [(10,12)] (because (10,12) isn't E range inclusive)针对上述情况,我制定了以下算法:
M = (x,y)
E = (m,n)
if (m<=x):
if(y<=n):
delete the row
elif (x<=n):
(start, end) = (n+1,y)
else:
continue
else:
if(y>=n):
(start, end) = (x,m-1)
(start, end) = (n-1,y)
elif (y>=m):
(start, end) = (x,m-1)
else:
continue但我想利用NumPy和熊猫的结合来实现它:
df1 = pd.read_csv('a.tsv', sep='\t') #main dataframe which I want to filter
temp_bed = bedfile # 2nd dataframe based on which I need to filter
# for array broadcasting
m = temp_bed['first.start'].to_numpy()[:, None]
n = temp_bed['first.end'].to_numpy()[:, None]
# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []
while offset < len(df1):
x = df1['first.start'].to_numpy()[offset:offset+chunk_size] #main
y = df1['first.end'].to_numpy()[offset:offset+chunk_size]
mask.append(
# necessary logical conditions #####
# but the problem is with splitting the rows or ranges
((m <= x) & (n >= y)).any(axis=0)
)
offset += chunk_size
import numpy as np
mask = np.hstack(mask)
df1[mask]有人能给我一个基于上述条件的拆分、删除或忽略数据行的有效解决方案吗?
发布于 2022-02-27 21:16:40
你的案子可以用Numpy和Pandas一起解决。
为了得到所有情况的结果,我将M和E相加一对:
M = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = [( 5, 7), (15,16), (15,18), (21,25), ( 5,25), ( 5,15), (13,20)]要获得结果,可以作为列表和对序列(包括连续数字的范围)运行:
for m1, m2, e1, e2 in np.hstack([M, E]):
s = pd.Index(np.arange(m1, m2 + 1)).difference(pd.Index(
np.arange(e1, e2 + 1))).to_series()
rng = s.groupby(s.subtract(s.shift()).gt(1).cumsum()).apply(
lambda grp: (grp.iloc[0], grp.iloc[-1]))
print(f'{m1:2}, {m2:2}, {e1:2}, {e2:2}\n{s.tolist()}\n{rng.values}\n')对于我得到的上述源数据:
10, 20, 5, 7
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]
10, 20, 15, 16
[10, 11, 12, 13, 14, 17, 18, 19, 20]
[(10, 14) (17, 20)]
10, 20, 15, 18
[10, 11, 12, 13, 14, 19, 20]
[(10, 14) (19, 20)]
10, 20, 21, 25
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]
10, 20, 5, 25
[]
[]
10, 20, 5, 15
[16, 17, 18, 19, 20]
[(16, 20)]
10, 20, 13, 20
[10, 11, 12]
[(10, 12)]每个3行序列包含:
https://stackoverflow.com/questions/71280600
复制相似问题