首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >查找多个嵌套字典之间的重叠

查找多个嵌套字典之间的重叠
EN

Stack Overflow用户
提问于 2022-04-07 07:29:53
回答 2查看 70关注 0票数 3

在python中,我有一个大型字典,其中包含较小的嵌套字典。

这本字典可能有多达7个嵌套字典,多达100个外显子。某些外显子之间总是有重叠的。下面是一个相对较小的例子:

代码语言:javascript
复制
{
    0: {
        "exon_count": 4,
        "length": 29832,
        "exon": {
            0: {"start": 77, "end": 323},
            1: {"start": 1245, "end": 1507},
            2: {"start": 6225, "end": 6598},
            3: {"start": 29186, "end": 29909},
        },
    },
    1: {
        "exon_count": 3,
        "length": 6688,
        "exon": {
            0: {"start": 0, "end": 323},
            1: {"start": 1245, "end": 1507},
            2: {"start": 6225, "end": 6688},
        },
    },
    2: {
        "exon_count": 4,
        "length": 6688,
        "exon": {
            0: {"start": 0, "end": 323},
            1: {"start": 487, "end": 971},
            2: {"start": 1245, "end": 1507},
            3: {"start": 6225, "end": 6688},
        },
    },
}

我很难找到一种方法来提取外显子之间的所有重叠。即,只提取每个嵌套字典中出现的起始坐标和结束坐标。

以下是此示例中的重叠:

代码语言:javascript
复制
77-323
1245-1507
6225-6598

我想使用一个基本的for循环,但我不认为它在这里工作得很好。请帮帮我!

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-04-07 08:02:10

您可以将这些范围中的所有整数放入一个set中,然后交叉所有set,最后将得到的set转换回范围列表。理论上它并不是最有效的,因为它取决于范围(end - start)的大小。但是,如果这些都与你的例子大致相同,那就无关紧要了。

下面是一个带有一些解释性注释和doctest的实现。

代码语言:javascript
复制
def overlaps(exons_dict):
    '''
    Computes overlap between all exons in the dictionary. Returns it as a list
    of (start, end) tuples, with start and end inclusive.

    >>> overlaps(EXAMPLE)
    [(77, 323), (1245, 1507), (6225, 6598)]
    '''
    # Handle the case of no exons, otherwise there is no first.
    if not exons_dict:
        return []
    # Create an iterator over all exons.
    exons = iter(exons_dict.values())
    # Initialize the overlap set from the first exon.
    overlap = exon_values(next(exons))
    # Intersect current overlap with each successive exon.
    for exon in exons:
        overlap.intersection_update(exon_values(exon))
    # Turn the overlap set back into a list of ranges.
    return values_to_ranges(overlap)

def exon_values(exon):
    '''
    Given an exon, converts all values in all its ranges into a set.
    '''
    values = set()
    for row in exon["exon"].values():
        for i in range(row["start"], row["end"] + 1):
            values.add(i)
    return values

def values_to_ranges(values):
    '''
    Turns a set into a list of (start, end) tuples, with start and end
    inclusive.

    >>> values_to_ranges({7, 6, 1, 8, 3, 2})
    [(1, 3), (6, 8)]
    >>> values_to_ranges({4, 3})
    [(3, 4)]
    >>> values_to_ranges(set())
    []
    '''
    # Sort the values into an ascending list.
    sorted_values = sorted(list(values))
    # Extract ranges.
    ranges = []
    for value in sorted_values:
        # Does current value belong to latest range?
        if ranges and value == ranges[-1][1] + 1:
            # Yes: extend current range.
            ranges[-1] = (ranges[-1][0], value)
        else:
            # No: create a new range.
            ranges.append((value, value))
    return ranges
票数 2
EN

Stack Overflow用户

发布于 2022-04-07 09:02:10

使用递归在字典中遍历,直到满足exon键,然后将所有“间隔”作为对附加。这些重叠是通过计算每个可能配对的交集来发现的( set -> range -> interval方法取自托马斯,否则进行if-else搜索以找到最小和最大)。

代码语言:javascript
复制
import itertools as it

data = # from question

def exon_overlaps(data: dict):
    overlaps = []

    for k in data:
        if isinstance(next_d := data[k], dict):
            if k != 'exon':
                overlaps.extend(exon_overlaps(next_d))
            else:
                overlaps.extend((interval['start'], interval['end']) for interval in next_d.values())
                return overlaps
    return overlaps


intervals = exon_overlaps(d)

# step 2: find intersections
overlaps = []
for (a, b), (c, d) in it.combinations(intervals, 2):
    if (a, b) != (c, d):
        if (i := set.intersection(set(range(a, b+1)), set(range(c, d+1)))) != set():
            if (interval := f'{min(i)}-{max(i)}') not in overlaps:
                overlaps.append(interval)

print(overlaps)

输出

代码语言:javascript
复制
['77-323', '6225-6598']

注:输出不同于问题中的输出,因为没有解释如何以一致的方式处理重复间隔。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71778016

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档