文章/答案/技术大牛

发布

社区首页 >问答首页 >在熊猫df中列的值在范围内的行分组

问在熊猫df中列的值在范围内的行分组
EN

Stack Overflow用户

提问于 2018-09-20 09:30:47

回答 1查看 58关注 0票数 0

我有只熊猫df：

number  sample  chrom1  start   chrom2  end
1   s1  1   0   2   1500
2   s1  2   10  2   50
19  s2  3   3098318 3   3125700
19  s3  3   3098720 3   3125870
20  s4  3   3125694 3   3126976
20  s1  3   3125694 3   3126976
20  s1  3   3125695 3   3126976
20  s5  3   3125700 3   3126976
21  s3  3   3125870 3   3134920
22  s2  3   3126976 3   3135039
24  s5  3   17286051    3   17311472
25  s2  3   17286052    3   17294628
26  s4  3   17286052    3   17311472
26  s1  3   17286052    3   17311472
27  s3  3   17286405    3   17294550
28  s4  3   17293197    3   17294628
28  s1  3   17293197    3   17294628
28  s5  3   17293199    3   17294628
29  s2  3   17294628    3   17311472

我试图对不同数目的品系进行分组，但start位于+/- 10 和内，末端也在同一条染色体上的+/- 10中。

在这个例子中，我想找到这两行：

24  s5  3   17286051    3   17311472
26  s4  3   17286052    3   17311472

其中，两者具有相同的chrom1 [3]和chrom2 [3]，而start和end值是来自彼此的+/- 10，并将它们按相同的编号分组：

24  s5  3   17286051    3   17311472
24  s4  3   17286052    3   17311472 # Change the number to the first seen in this series

我想做的是：

import pandas as pd
from collections import defaultdict


def parse_vars(inFile):

    df = pd.read_csv(inFile, delimiter="\t")
    df = df[['number', 'chrom1', 'start', 'chrom2', 'end']]

    vars = {}

    seen_l = defaultdict(lambda: defaultdict(dict)) # To track the `starts`
    seen_r = defaultdict(lambda: defaultdict(dict)) # To track the `ends`

    for index in df.index:
        event = df.loc[index, 'number']
        c1 = df.loc[index, 'chrom1']
        b1 = int(df.loc[index, 'start'])
        c2 = df.loc[index, 'chrom2']
        b2 = int(df.loc[index, 'end'])

        print [event, c1, b1, c2, b2]

        vars[event] = [c1, b1, c2, b2]

        # Iterate over windows +/- 10
        for i, j in zip( range(b1-10, b1+10), range(b2-10, b2+10) ):
            # if : 
            # i in seen_l[c1] AND
            # j in seen_r[c2] AND
            # the 'number' for these two instances is the same: 

            if i in seen_l[c1] and j in seen_r[c2] and seen_l[c1][i] == seen_r[c2][j]:
                print seen_l[c1][i], seen_r[c2][j]
                if seen_l[c1][i] != event: print"Seen: %s %s in event %s %s" % (event, [c1, b1, c2, b2], seen_l[c1][i], vars[seen_l[c1][i]])

        seen_l[c1][b1] = event
        seen_r[c2][b2] = event

我遇到的问题是，seen_l[3][17286052]同时存在于numbers 25和26中，由于它们各自的seen_r事件(seen_r[3][17294628] = 25、seen_r[3][17311472] = 26)并不相等，所以我无法将这些行连接在一起。

我是否可以使用start值列表作为seen_l dict的嵌套键？

python

python-2.7

pandas

回答 1

Stack Overflow用户

发布于 2020-04-22 09:10:52

区间重叠在吡喃中是很容易的。下面的大部分代码是将开始和结束分离为两个不同的dfs。然后，根据+-10的间隔重叠将它们连接起来：

from io import StringIO
import pandas as pd
import pyranges as pr

c = """number  sample  chrom1  start   chrom2  end
1   s1  1   0   2   1500
2   s1  2   10  2   50
19  s2  3   3098318 3   3125700
19  s3  3   3098720 3   3125870
20  s4  3   3125694 3   3126976
20  s1  3   3125694 3   3126976
20  s1  3   3125695 3   3126976
20  s5  3   3125700 3   3126976
21  s3  3   3125870 3   3134920
22  s2  3   3126976 3   3135039
24  s5  3   17286051    3   17311472
25  s2  3   17286052    3   17294628
26  s4  3   17286052    3   17311472
26  s1  3   17286052    3   17311472
27  s3  3   17286405    3   17294550
28  s4  3   17293197    3   17294628
28  s1  3   17293197    3   17294628
28  s5  3   17293199    3   17294628
29  s2  3   17294628    3   17311472"""

df = pd.read_table(StringIO(c), sep="\s+")

df1 = df[["chrom1", "start", "number", "sample"]]
df1.insert(2, "end", df.start + 1)

df2 = df[["chrom2", "end", "number", "sample"]]
df2.insert(2, "start", df.end - 1)

names = ["Chromosome", "Start", "End", "number", "sample"]
df1.columns = names
df2.columns = names

gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)

j = gr1.join(gr2, slack=10)
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# | Chromosome   | Start     | End       | number    | sample     | Start_b   | End_b     | number_b   | sample_b   |
# | (category)   | (int32)   | (int32)   | (int64)   | (object)   | (int32)   | (int32)   | (int64)    | (object)   |
# |--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------|
# | 3            | 3125694   | 3125695   | 20        | s4         | 3125700   | 3125699   | 19         | s2         |
# | 3            | 3125694   | 3125695   | 20        | s1         | 3125700   | 3125699   | 19         | s2         |
# | 3            | 3125695   | 3125696   | 20        | s1         | 3125700   | 3125699   | 19         | s2         |
# | 3            | 3125700   | 3125701   | 20        | s5         | 3125700   | 3125699   | 19         | s2         |
# | ...          | ...       | ...       | ...       | ...        | ...       | ...       | ...        | ...        |
# | 3            | 17294628  | 17294629  | 29        | s2         | 17294628  | 17294627  | 25         | s2         |
# | 3            | 17294628  | 17294629  | 29        | s2         | 17294628  | 17294627  | 28         | s5         |
# | 3            | 17294628  | 17294629  | 29        | s2         | 17294628  | 17294627  | 28         | s1         |
# | 3            | 17294628  | 17294629  | 29        | s2         | 17294628  | 17294627  | 28         | s4         |
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# Unstranded PyRanges object has 13 rows and 9 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.

# to get the data as a pandas df:
jdf = j.df

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52421855

复制

相似问题

问在熊猫df中列的值在范围内的行分组
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在熊猫df中列的值在范围内的行分组EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在熊猫df中列的值在范围内的行分组
EN