在python中,我有一个大型字典,其中包含较小的嵌套字典。
这本字典可能有多达7个嵌套字典,多达100个外显子。某些外显子之间总是有重叠的。下面是一个相对较小的例子:
{
0: {
"exon_count": 4,
"length": 29832,
"exon": {
0: {"start": 77, "end": 323},
1: {"start": 1245, "end": 1507},
2: {"start": 6225, "end": 6598},
3: {"start": 29186, "end": 29909},
},
},
1: {
"exon_count": 3,
"length": 6688,
"exon": {
0: {"start": 0, "end": 323},
1: {"start": 1245, "end": 1507},
2: {"start": 6225, "end": 6688},
},
},
2: {
"exon_count": 4,
"length": 6688,
"exon": {
0: {"start": 0, "end": 323},
1: {"start": 487, "end": 971},
2: {"start": 1245, "end": 1507},
3: {"start": 6225, "end": 6688},
},
},
}我很难找到一种方法来提取外显子之间的所有重叠。即,只提取每个嵌套字典中出现的起始坐标和结束坐标。
以下是此示例中的重叠:
77-323
1245-1507
6225-6598我想使用一个基本的for循环,但我不认为它在这里工作得很好。请帮帮我!
发布于 2022-04-07 08:02:10
您可以将这些范围中的所有整数放入一个set中,然后交叉所有set,最后将得到的set转换回范围列表。理论上它并不是最有效的,因为它取决于范围(end - start)的大小。但是,如果这些都与你的例子大致相同,那就无关紧要了。
下面是一个带有一些解释性注释和doctest的实现。
def overlaps(exons_dict):
'''
Computes overlap between all exons in the dictionary. Returns it as a list
of (start, end) tuples, with start and end inclusive.
>>> overlaps(EXAMPLE)
[(77, 323), (1245, 1507), (6225, 6598)]
'''
# Handle the case of no exons, otherwise there is no first.
if not exons_dict:
return []
# Create an iterator over all exons.
exons = iter(exons_dict.values())
# Initialize the overlap set from the first exon.
overlap = exon_values(next(exons))
# Intersect current overlap with each successive exon.
for exon in exons:
overlap.intersection_update(exon_values(exon))
# Turn the overlap set back into a list of ranges.
return values_to_ranges(overlap)
def exon_values(exon):
'''
Given an exon, converts all values in all its ranges into a set.
'''
values = set()
for row in exon["exon"].values():
for i in range(row["start"], row["end"] + 1):
values.add(i)
return values
def values_to_ranges(values):
'''
Turns a set into a list of (start, end) tuples, with start and end
inclusive.
>>> values_to_ranges({7, 6, 1, 8, 3, 2})
[(1, 3), (6, 8)]
>>> values_to_ranges({4, 3})
[(3, 4)]
>>> values_to_ranges(set())
[]
'''
# Sort the values into an ascending list.
sorted_values = sorted(list(values))
# Extract ranges.
ranges = []
for value in sorted_values:
# Does current value belong to latest range?
if ranges and value == ranges[-1][1] + 1:
# Yes: extend current range.
ranges[-1] = (ranges[-1][0], value)
else:
# No: create a new range.
ranges.append((value, value))
return ranges发布于 2022-04-07 09:02:10
使用递归在字典中遍历,直到满足exon键,然后将所有“间隔”作为对附加。这些重叠是通过计算每个可能配对的交集来发现的( set -> range -> interval方法取自托马斯,否则进行if-else搜索以找到最小和最大)。
import itertools as it
data = # from question
def exon_overlaps(data: dict):
overlaps = []
for k in data:
if isinstance(next_d := data[k], dict):
if k != 'exon':
overlaps.extend(exon_overlaps(next_d))
else:
overlaps.extend((interval['start'], interval['end']) for interval in next_d.values())
return overlaps
return overlaps
intervals = exon_overlaps(d)
# step 2: find intersections
overlaps = []
for (a, b), (c, d) in it.combinations(intervals, 2):
if (a, b) != (c, d):
if (i := set.intersection(set(range(a, b+1)), set(range(c, d+1)))) != set():
if (interval := f'{min(i)}-{max(i)}') not in overlaps:
overlaps.append(interval)
print(overlaps)输出
['77-323', '6225-6598']注:输出不同于问题中的输出,因为没有解释如何以一致的方式处理重复间隔。
https://stackoverflow.com/questions/71778016
复制相似问题