文章/答案/技术大牛

发布

社区首页 >问答首页 >用Pandas解析床文件

问用Pandas解析床文件
EN

Code Review用户

提问于 2020-08-11 07:13:43

回答 1查看 1.2K关注 0票数 6

对于count=1s和起始和结尾之间的部分，合并重叠位置并输出计数的中值。

输入

chr         start   stop   strand   count
chr1        0       13320   -       1
chr1        13320   13321   -       2
chr1        13321   13328   -       1
chr1        13328   13342   -       2
chr1        13342   13343   -       18
chr1        13343   13344   -       36
chr1        13344   13345   -       18
chr1        13345   13346   -       6
chr1        13346   16923   -       1
chr1        16923   16942   -       3
chr1        16942   16943   -       2

输出

chr1        13320   13321    2
chr1        13328   13346   18
chr1        16923   16943   2.5

第二个值：

Start 13328 --这是因为表中的第4个值有start 13328.，这是第二个count=1之后的行。
Stop 13346 -这是因为表中的第8个值有stop 13346. 13346.，这是第三个count=1之前的行。
数18 -这是第4和第8组之间的数的中位数。

这是我的密码。

from pathlib import Path
import pandas as pd
file = Path("bed_file.bed")
# load with pandas
df = pd.read_csv(file, sep='\t', header=None)

# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

# create new dataframe
newDF = pd.DataFrame(columns=['chr','start', 'stop', 'count'])
# last position
end = df.index[-1]

# parse dataframe
for idx, elem in enumerate(li):
    if elem != li[-1]: 
        next_elem = li[(idx + 1) % len(li)] # next element where count=1
        start = df.iloc[elem]['stop'] # start position 
        stop = df.iloc[next_elem-1]['stop'] # stop position
        if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
            count = df.iloc[elem+1]['count']
            #print(f"start={start}\tstop={stop}\tcount={count}")
        else:
            count = df.iloc[elem+1:next_elem]['count'].median()
            #print(f"start={start}\tstop={stop}\tcount={count}")
        newDF = newDF.append({
            'chr' : df.loc[0,'chr'],
            'start' : start,
            'stop' : stop,
            'count' : count
            
        },ignore_index=True)
    else: # last element in the list
        start = df.iloc[elem]['stop']
        stop = df.iloc[end]['stop']
        count = df.iloc[elem+1:end+1]['count'].median()
        #print(f"start={start}\tstop={stop}\tcount={count}")
        newDF = newDF.append({
            'chr' : df.loc[0,'chr'],
            'start' : start,
            'stop' : stop,
            'count' : count
        },ignore_index=True)

有更好的方法吗？我的代码是Pythonic吗？

python

pandas

回答 1

Code Review用户

发布于 2020-08-11 16:56:02

我将首先对您的代码进行一些评论，然后我将向您展示如何处理这个问题。

在请求代码评审#print(f"start={start}\tstop={stop}\tcount={count}")之前，应该删除注释掉的代码。
许多评论并没有增加价值。# last position本身并不意味着什么。你为什么想要最后一个职位？为什么代码解释得不够好呢？
通常，循环中一个分支在开始或结束时只被取一次的if/ end可以被移除。您可以更少地迭代，并显式地处理这种情况。您可以添加一个前哨值，这样您就不必检查迭代器是否在末尾。您可以使用可用的库或内置函数，这将为您处理这种情况。

# load with pandas
df = pd.read_csv(file, sep='\t', header=None)

# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

如果头被剪短了len(df.columns) < len(header)，那么第一件要被切断的就是列df['count']。然后，通过使用它，假设它立即存在。到底是哪一个？它是否总是存在，还是有时会没有足够的列？错误始终存在，代码就会变成

# load with pandas
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

...

for idx, elem in enumerate(li):

如果您使用的是熊猫(或numpy)，那么在库和Python之间来回移动数据通常不是最好的方法。您将失去库的大部分效率，而代码通常会变得可读性差得多。

不要使用像li这样的名字。它不会给读者任何信息。如果您有一个索引列表，您将使用该列表做什么？这样的名字会好得多。

更多地使用熊猫，重命名会给你带来类似的东西

splitting_indices = df.index[df['count'] == 1].tolist()

for idx, elem in enumerate(splitting_indices):

if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
    count = df.iloc[elem+1]['count']
    #print(f"start={start}\tstop={stop}\tcount={count}")
else:
    count = df.iloc[elem+1:next_elem]['count'].median()

在从dataframe获取数据的过程中找到这个逻辑并不容易。这是核心逻辑，应该被视为核心逻辑。至少把它放在一个函数中。

def extract_median(df, elem, next_elem):
    if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
        count = df.iloc[elem+1]['count']
    else:
        count = df.iloc[elem+1:next_elem]['count'].median()
    return count

现在，更明显的是，这一评论是假的。您可以计算单个元素列表的中位数。那么，为什么我们要用这个特殊的大小写呢?即使df.iloc[elem+1:next_elem]比elem+1大一个，next_elem也能工作。

def extract_median(df, elem, next_elem):
    return df.iloc[elem+1:next_elem]['count'].median()

现在我们可以看到，函数可能是不必要的。

我要采取的方法是尽可能长时间地使用熊猫。没有循环。没有收费员。因为我不需要循环，所以索引可能也不需要，所以我可以限制iloc和df.index的使用。

首先，读取数据

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)

     chr  start   stop strand  count
0   chr1      0  13320      -      1
1   chr1  13320  13321      -      2
2   chr1  13321  13328      -      1
3   chr1  13328  13342      -      2
4   chr1  13342  13343      -     18
5   chr1  13343  13344      -     36
6   chr1  13344  13345      -     18
7   chr1  13345  13346      -      6
8   chr1  13346  16923      -      1
9   chr1  16923  16942      -      3
10  chr1  16942  16943      -      2

然后，找出每一排兴趣。这将是无处不在的count不是1。

df['count'] != 1

0     False
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True

我希望将所有连续的真正行组合在一起。按照列值对连续行进行分组的常用方法是

记录一下记录。
将列中的每个值与下一个值进行比较。
如果他们是一样的，什么都不要做。
如果它们不同，则将1添加到运行的计数中。
将计算结果与该值相关联。
理货组。

用代码

mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()

    count   mask  tally
0       1  False      1
1       2   True      2
2       1  False      3
3       2   True      4
4      18   True      4
5      36   True      4
6      18   True      4
7       6   True      4
8       1  False      5
9       3   True      6
10      2   True      6

分组然后给

df.groupby(tally).groups

{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 5: Int64Index([8], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

因为您只需要计数不为1的行，所以我们可以重用掩码来过滤掉它们。

df[mask].groupby(tally).groups

{2: Int64Index([1], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

最后，从一只石斑鱼身上可以快速地得到中值。

df[mask].groupby(tally).median()

         start     stop  count
count                         
2      13320.0  13321.0    2.0
4      13343.0  13344.0   18.0
6      16932.5  16942.5    2.5

最后，代码要短得多。

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
df[mask].groupby(tally).median()

票数 4

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/247757

复制

相似问题

问用Pandas解析床文件
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Pandas解析床文件EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Pandas解析床文件
EN