首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >基于距离滤波的数据行分割或删除的有效方法

基于距离滤波的数据行分割或删除的有效方法
EN

Stack Overflow用户
提问于 2022-02-26 22:00:36
回答 1查看 41关注 0票数 0

我有两个长度不等的数据,其中第一个数据行将根据第二个数据帧的范围进行过滤。要获得更好的I/O上下文,请参阅以下文章:Efficient solution of dataframe range filtering based on another ranges

例如:

代码语言:javascript
复制
M = (x,y) = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = (m,n) = [(5,7), (15,16), (15,18), (21,25), (5,25), (5,15)]

Case-1: 
    M = [(10,20)]
    E = [(5,7)]
out: M = [(10,20)] (no change, because out of E range)

Case-2: 
    M = [(10,20)]
    E = [(15,16)]
out: M = [(10,14),(17,20)] (split (10,20) into 2 rows to remove E range inside it)

Case-3: 
    M = [(10,20)]
    E = [(21,25)]
out: M = [(10,20)] (no change, because out of E range)

Case-4: 
    M = [(10,20)]
    E = [(5,25)]
out: M = [] (delete because totally inclusive within range of E)

Case-5: 
    M = [(10,20)]
    E = [(5,15)]
out: M = [(16,20)] (because (16,20) isn't E range inclusive)

Case-6: 
    M = [(10,20)]
    E = [(13,20)]
out: M = [(10,12)] (because (10,12) isn't E range inclusive)

针对上述情况,我制定了以下算法:

代码语言:javascript
复制
M = (x,y) 
E = (m,n) 
if (m<=x): 
    if(y<=n):
        delete the row
    elif (x<=n):
        (start, end) = (n+1,y)
    else:
        continue
else:
    if(y>=n):
        (start, end) = (x,m-1)
        (start, end) = (n-1,y)
    elif (y>=m):
        (start, end) = (x,m-1)
    else:
        continue

但我想利用NumPy和熊猫的结合来实现它:

代码语言:javascript
复制
df1 = pd.read_csv('a.tsv', sep='\t') #main dataframe which I want to filter
temp_bed = bedfile # 2nd dataframe based on which I need to filter
# for array broadcasting
m = temp_bed['first.start'].to_numpy()[:, None]
n = temp_bed['first.end'].to_numpy()[:, None]

# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []

while offset < len(df1):
    x = df1['first.start'].to_numpy()[offset:offset+chunk_size] #main
    y = df1['first.end'].to_numpy()[offset:offset+chunk_size]
    mask.append(
# necessary logical conditions #####
# but the problem is with splitting the rows or ranges
        ((m <= x) & (n >= y)).any(axis=0)
    )    

    offset += chunk_size
import numpy as np
mask = np.hstack(mask)
df1[mask]

有人能给我一个基于上述条件的拆分、删除或忽略数据行的有效解决方案吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-27 21:16:40

你的案子可以用Numpy和Pandas一起解决。

为了得到所有情况的结果,我将M和E相加一对:

代码语言:javascript
复制
M = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = [( 5, 7), (15,16), (15,18), (21,25), ( 5,25), ( 5,15), (13,20)]

要获得结果,可以作为列表和对序列(包括连续数字的范围)运行:

代码语言:javascript
复制
for m1, m2, e1, e2 in np.hstack([M, E]):
    s = pd.Index(np.arange(m1, m2 + 1)).difference(pd.Index(
        np.arange(e1, e2 + 1))).to_series()
    rng = s.groupby(s.subtract(s.shift()).gt(1).cumsum()).apply(
        lambda grp: (grp.iloc[0], grp.iloc[-1]))
    print(f'{m1:2}, {m2:2},  {e1:2}, {e2:2}\n{s.tolist()}\n{rng.values}\n')

对于我得到的上述源数据:

代码语言:javascript
复制
10, 20,   5,  7
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]

10, 20,  15, 16
[10, 11, 12, 13, 14, 17, 18, 19, 20]
[(10, 14) (17, 20)]

10, 20,  15, 18
[10, 11, 12, 13, 14, 19, 20]
[(10, 14) (19, 20)]

10, 20,  21, 25
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]

10, 20,   5, 25
[]
[]

10, 20,   5, 15
[16, 17, 18, 19, 20]
[(16, 20)]

10, 20,  13, 20
[10, 11, 12]
[(10, 12)]

每个3行序列包含:

  • 原始数据(来自M的一对数据和来自E的一对数据),
  • 表示结果为数字列表,
  • 为结果的序列。
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71280600

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档