首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >倒排索引列表

倒排索引列表
EN

Stack Overflow用户
提问于 2016-07-02 17:40:07
回答 7查看 408关注 0票数 1

我有一个索引列表,例如,

代码语言:javascript
复制
a = [
    [2],
    [0, 1, 3, 2],
    [1],
    [0, 3]
    ]

我现在要“反转”这个列表:数字0出现在索引13中,因此:

代码语言:javascript
复制
b = [
    [1, 3],
    [1, 2],
    [0, 1],
    [1, 3]
    ]

关于如何快速做到这一点,有什么建议吗?(我正在处理的清单可能很大。)

奖励:我知道每个索引在a中都会出现两次(就像上面的例子一样)。

EN

回答 7

Stack Overflow用户

回答已采纳

发布于 2016-07-02 17:54:55

此代码不依赖于每个数字出现两次这一事实。它也非常简单,避免了构建字典,然后从那里复制结果的开销:

代码语言:javascript
复制
a = [
        [2],
        [0, 1, 3, 2],
        [1],
        [0, 3]
    ]

b = []

for i, nums in enumerate(a):

    # For each number found at this index
    for num in nums:


        # If needed, extend b to cover the new needed range
        b += [[] for _ in range(num + 1 - len(b)]

        # Store the index
        b[num].append(i)

print(b)

# Output:
# [[1, 3], [1, 2], [0, 1], [1, 3]]
票数 5
EN

Stack Overflow用户

发布于 2016-07-02 17:43:32

使用字典收集倒排索引,使用enumerate()a条目生成索引:

代码语言:javascript
复制
inverted = {}
for index, numbers in enumerate(a):
    for number in numbers:
        inverted.setdefault(number, []).append(index)

b = [inverted.get(i, []) for i in range(max(inverted) + 1)]

字典为您提供了有效的随机访问来添加反转,但这确实意味着您需要考虑倒排中可能缺少的索引,因此使用range(max(inverted))循环来确保覆盖0到最大值之间的所有索引。

演示:

代码语言:javascript
复制
>>> a = [
...     [2],
...     [0, 1, 3, 2],
...     [1],
...     [0, 3]
...     ]
>>> inverted = {}
>>> for index, numbers in enumerate(a):
...     for number in numbers:
...         inverted.setdefault(number, []).append(index)
...
>>> [inverted.get(i, []) for i in range(max(inverted) + 1)]
[[1, 3], [1, 2], [0, 1], [1, 3]]
票数 5
EN

Stack Overflow用户

发布于 2016-07-02 17:49:29

假设每个索引只出现两次,下面的代码可以工作:

代码语言:javascript
复制
from itertools import chain

a = [[2],
     [0, 1, 3, 2],
     [1],
     [0, 3]]

b = (max(chain(*a)) + 1) * [None]

for i, lst in enumerate(a):
    for j in lst:
        if not b[j]:
            b[j] = [i, None]
        else:
            b[j][1] = i

正如@smarx所指出的,如果我们进一步假设len(a)表示值的范围,如示例所示,上述解决方案可以简化为:

代码语言:javascript
复制
a = [[2],
     [0, 1, 3, 2],
     [1],
     [0, 3]]

b = len(a) * [[None]]

for i, lst in enumerate(a):
    for j in lst:
        if not b[j]:
            b[j] = [i, None]
        else:
            b[j][1] = i

编辑:解决方案的比较.

对于大型数组来说,使用append并不是最优的,因为它重新分配内存。因此,两次遍历数组a可能会更快。

为了测试它,我创建了一个函数gen_list,它根据问题的假设生成一个列表。守则如下:

代码语言:javascript
复制
# This answer's solution
def solution1(a):
    from itertools import chain

    b = (max(chain(*a)) + 1)* [None]

    for i, lst in enumerate(a):
        for j in lst:
            if not b[j]:
                b[j] = [i, None]
            else:
                b[j][1] = i

    return b


# smarx's solution
def solution2(a):
    b = []

    for i, nums in enumerate(a):

        # For each number found at this index
        for num in nums:

            # If needed, extend b to cover the new needed range
            for _ in range(num + 1 - len(b)):
                b.append([])

            # Store the index
            b[num].append(i)

    return b


# Martijn Pieters's solution
def solution3(a):
    inverted = {}
    for index, numbers in enumerate(a):
        for number in numbers:
            inverted.setdefault(number, []).append(index)

    return [inverted.get(i, []) for i in range(max(inverted) + 1)]


# eugene y's solution
def solution4(a):
    b = []    
    for i, lst in enumerate(a):
        for j in lst:
            if j >= len(b):
                b += [[] for _ in range(j - len(b) + 1)]
            b[j].append(i)


def gen_list(n):
    from numpy.random import choice
    lst = []
    for _ in range(n):
        lst.append([])
    for i in range(n):
        lst[choice(n)].append(i)
        lst[choice(n)].append(i)
    return lst

然后,测试解决方案的速度:

代码语言:javascript
复制
In [1]: a = gen_list(10)

In [2]: %timeit solution1(a)
The slowest run took 8.68 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 9.45 µs per loop

In [3]: %timeit solution2(a)
The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 14.5 µs per loop

In [4]: %timeit solution3(a)
100000 loops, best of 3: 12.2 µs per loop

In [5]: %timeit solution4(a)
The slowest run took 5.69 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 10.3 µs per loop

In [6]: a = gen_list(100)

In [7]: %timeit solution1(a)
10000 loops, best of 3: 70.5 µs per loop

In [8]: %timeit solution2(a)
10000 loops, best of 3: 135 µs per loop

In [9]: %timeit solution3(a)
The slowest run took 5.28 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 115 µs per loop

In [10]: %timeit solution4(a)
The slowest run took 6.75 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 76.6 µs per loop
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/38162862

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档