文章/答案/技术大牛

发布

社区首页 >问答首页 >快速消除大列表中的“循环复制”(python)

问快速消除大列表中的“循环复制”(python)
EN

Stack Overflow用户

提问于 2020-06-10 04:26:52

回答 2查看 107关注 0票数 0

我有这个(python)列表

my_list =[狗“，”猫“，”垫“，”乐趣“，”鲍勃“，”猫“，”潘“，”乐趣“，”狗“，”本“，”垫“，”老鼠“，

“猫”、“垫”、“有趣”、“狗”、“垫”、“乐趣”、“狗”、“猫”、“乐趣”、“狗”、“猫”、“垫”，

“老鼠”，“狗”，“本”，“垫子”，“狗”，“垫子”，“猫”，“乐趣”，……

]

my_list有200704个元素

请注意这里

my_list =‘狗’，‘猫’，‘垫’，‘乐趣’

狗->猫->垫->有趣->狗

my_list3 =“猫”，“垫”，“乐趣”，“狗”

猫->垫->有趣->狗->猫

my_list4 = 'mat'，‘乐趣’，‘狗’，‘猫’

席->乐趣->狗->猫->垫

my_list5 =‘乐趣’，‘狗’，‘猫’，‘垫子’

有趣->狗->猫->垫子->乐趣

环游世界，它们都一样。所以它们应该被标记为副本。

注意：

my_list =‘狗’，‘猫’，‘垫’，‘乐趣’

my_list7 =‘狗’，‘垫’，‘猫’，‘乐趣’

这些不应标记重复，因为它是循环的，它们是不同的。

类似地，

my_list2 =“狗”，“本”，“垫”，“老鼠”

my_list6 =“老鼠”，“狗”，“本”，“垫子”

它们应该被标记为副本。

def remove_circular_duplicates(my_list):
    # the quicker and more elegent logic here

    # the function should identify that my_list[0], my_list[3], my_list[4] and my_list[5] are circular duplicates
    # keep only my_list[0] and delete the rest 3
    # same for my_list[2] and my_list[6] and so on

    return (my_list_with_no_circular_duplicates)

我的尝试：

这是可行的，但，需要超过3个小时来完成200704元素。

这也不是一种优雅的方式。(请原谅我的级别)

t=my_list
tLen=len(t)
while i<tLen:
    c=c+1
    if c>2000:
        # this is just to keep you informed of the progress
        print(f'{i} of {tLen} finished ..')
        c=0
    if (finalT[i][4]=='unmarked'):
        # make 0-1-2-3 -> 1-2-3-0 and check any duplicates
        x0,x1,x2,x3 = t[i][1],t[i][2],t[i][3],t[i][0]
        # make 0-1-2-3 -> 2-3-0-1 and check any duplicates
        y0,y1,y2,y3 = t[i][2],t[i][3],t[i][0],t[i][1]
        # make 0-1-2-3 -> 3-0-1-2 and check any duplicates
        z0,z1,z2,z3 = t[i][3],t[i][0],t[i][1],t[i][2]
        while j<tLen:
            if (finalT[j][4]=='unmarked' and j!=i):
                #j!=i skips checking the same (self) element
                tString=t[j][0]+t[j][1]+t[j][2]+t[j][3]
                if (x0+x1+x2+x3 == tString) or (y0+y1+y2+y3 == tString) or (z0+z1+z2+z3 == tString):
                    # duplicate found, mark it as 'duplicate'
                    finalT[j][4]='duplicate'
                tString=''
            j=j+1
        finalT[i][4] = 'original'
        j=0
    i=i+1
# make list of only those marked as 'original'
i=0
ultimateT = []
while i<tLen:
    if finalT[i][4] == 'original':
        ultimateT.append(finalT[i])
    i=i+1
# strip the 'oritinal' mark and keep only the quad
i=0
ultimateTLen=len(ultimateT)
while i<ultimateTLen:
    ultimateT[i].remove('original')
    i=i+1
my_list_with_no_curcular_duplicates = ultimateT

print (f'\n\nDONE!!  \nStarted at: {start_time}\nEnded at {datetime.datetime.now()}')
return my_list_with_no_circular_duplicates

我想要的是一个更快的方法来做同样的事情。

Tnx提前。

list

duplicates

logic

circular-dependency

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-06-10 04:55:30

您的实现是一个n平方算法，这意味着对于一个大数据集，实现时间将急剧增长。20万平方是一个很大的数字。您需要将其转换为阶n或n-log(n)算法。要做到这一点，您需要对数据进行预处理，这样您就可以检查列表中是否也有一个循环等价的项，而不必搜索列表。要做到这一点，就可以将每个条目放入一个表单中，在不需要遍历列表的情况下对它们进行比较。我建议您旋转每个条目，以便它首先具有字母顺序的第一项。例如，将“狗”、“猫”、“垫”、“乐趣”改为“猫”、“猫”、“有趣”、“狗”。这是一个order n操作，用于处理列表的每个元素一次。

然后，使用通用格式，您可以选择每个条目是否是唯一的。我会用一套。对于每个项目，检查项目是否在一个集合中，如果不是，它是唯一的，应该添加到集合中。如果项目已经在集合中，那么已经找到了一个等价的项，并且可以删除该项。检查一个项是否在一个集合中是Python中的一个固定时间操作。它通过使用哈希表来实现这一点，以便索引查找项，而不需要搜索。其结果是，这也是一个order操作，通过每个条目进行检查。总的来说，算法是n阶的，比你所做的要快得多。

票数 1

Stack Overflow用户

发布于 2020-06-10 12:17:48

@BradBudlong

布拉德·布隆的回答是对的。以下是相同的实现结果。

我的方法(在问题中)：

时间：~274分钟

结果: len(my_list_without_circular_duplicates) >> 50176

布拉德·布隆的方法：

时间：~12秒(好极了!)

结果: len(my_list_without_circular_duplicates) >> 50176

以下是Budlong方法的实现：

# extract all individual words like 'cat', 'rat', 'fun' and put in a list without duplicates 
all_non_duplicate_words_from_my_list = {.. the appropriate code here}
# and sort them alphabetically
alphabetically_sorted_words = sorted(all_non_duplicate_words_from_my_list)

# mark all as 'unsorted'
all_q_marked=[]
for i in my_list:
    all_q_marked.append([i,'unsorted'])

# format my_list- in Brad's words,
# rotate each entry so that it has the alphabetically first item first. 
# For example change ['dog','cat','mat','fun'] to ['cat','mat','fun','dog'] 
for w in alphabetically_sorted_words:
    print(f'{w} in progress ..')
    for q in all_q_marked:
        if q[1]=='unsorted':
            # check if the word exist in the quad
            if w in q[0]:
                # word exist, then rotate this quad to put that word in first place
                # rotation_count=q[0].index(w) -- alternate method lines
                quad=q[0]
                for j in range(4):
                    quad=quad[-1:] + quad[:-1]
                    if quad[0]==w:
                        q[0]=quad
                        break
                # mark as sorted
                q[1]='sorted'

# strip the 'sorted' mark and keep only the quad
i=0
formatted_my_list=[]
while i<len(all_q_marked):
    formatted_my_list.append(all_q_marked[i][0])
    i=i+1

# finally remove duplicate lists in the list
my_list_without_circular_duplicates = [list(t) for t in set(tuple(element) for element in formatted_my_list)]
print (my_list_without_circular_duplicates)

请注意，尽管它仍然使用整个alphabetically_sorted_words (200704)迭代和处理all_q_marked (201)，但随着all_q_marked中的元素被标记为“排序”，处理所需的时间也会成倍减少。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62295863

复制

相似问题

问快速消除大列表中的“循环复制”(python)
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问快速消除大列表中的“循环复制”(python)EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问快速消除大列表中的“循环复制”(python)
EN