快速消除大列表中的“循环重复项”（python）

def remove_circular_duplicates(my_list): # the quicker and more elegent logic here # the function should identify that my_list[0], my_list[3], my_list[4] and my_list[5] are circular duplicates # keep only my_list[0] and delete the rest 3 # same for my_list[2] and my_list[6] and so on return (my_list_with_no_circular_duplicates)

t=my_list tLen=len(t) while i<tLen: c=c+1 if c>2000: # this is just to keep you informed of the progress print(f'{i} of {tLen} finished ..') c=0 if (finalT[i][4]=='unmarked'): # make 0-1-2-3 -> 1-2-3-0 and check any duplicates x0,x1,x2,x3 = t[i][1],t[i][2],t[i][3],t[i][0] # make 0-1-2-3 -> 2-3-0-1 and check any duplicates y0,y1,y2,y3 = t[i][2],t[i][3],t[i][0],t[i][1] # make 0-1-2-3 -> 3-0-1-2 and check any duplicates z0,z1,z2,z3 = t[i][3],t[i][0],t[i][1],t[i][2] while j<tLen: if (finalT[j][4]=='unmarked' and j!=i): #j!=i skips checking the same (self) element tString=t[j][0]+t[j][1]+t[j][2]+t[j][3] if (x0+x1+x2+x3 == tString) or (y0+y1+y2+y3 == tString) or (z0+z1+z2+z3 == tString): # duplicate found, mark it as 'duplicate' finalT[j][4]='duplicate' tString='' j=j+1 finalT[i][4] = 'original' j=0 i=i+1 # make list of only those marked as 'original' i=0 ultimateT = [] while i<tLen: if finalT[i][4] == 'original': ultimateT.append(finalT[i]) i=i+1 # strip the 'oritinal' mark and keep only the quad i=0 ultimateTLen=len(ultimateT) while i<ultimateTLen: ultimateT[i].remove('original') i=i+1 my_list_with_no_curcular_duplicates = ultimateT print (f'\n\nDONE!! \nStarted at: {start_time}\nEnded at {datetime.datetime.now()}') return my_list_with_no_circular_duplicates

2条回答

网友

1楼 · 编辑于 2024-05-19 14:14:21

@BradBudlong
Brad Budlong的答案是正确的。以下是该方案的实施结果

我的方法（在问题中给出）：
所用时间：~274分钟
结果：len（我的列表中没有循环副本）>&燃气轮机；50176

Brad Budlong的方法：
所用时间：~12秒（很好！）
结果：len（我的列表中没有循环副本）>&燃气轮机；50176

以下是Brad Budlong方法的实现：

# extract all individual words like 'cat', 'rat', 'fun' and put in a list without duplicates 
all_non_duplicate_words_from_my_list = {.. the appropriate code here}
# and sort them alphabetically
alphabetically_sorted_words = sorted(all_non_duplicate_words_from_my_list)

# mark all as 'unsorted'
all_q_marked=[]
for i in my_list:
    all_q_marked.append([i,'unsorted'])

# format my_list- in Brad's words,
# rotate each entry so that it has the alphabetically first item first. 
# For example change ['dog','cat','mat','fun'] to ['cat','mat','fun','dog'] 
for w in alphabetically_sorted_words:
    print(f'{w} in progress ..')
    for q in all_q_marked:
        if q[1]=='unsorted':
            # check if the word exist in the quad
            if w in q[0]:
                # word exist, then rotate this quad to put that word in first place
                # rotation_count=q[0].index(w)   alternate method lines
                quad=q[0]
                for j in range(4):
                    quad=quad[-1:] + quad[:-1]
                    if quad[0]==w:
                        q[0]=quad
                        break
                # mark as sorted
                q[1]='sorted'

# strip the 'sorted' mark and keep only the quad
i=0
formatted_my_list=[]
while i<len(all_q_marked):
    formatted_my_list.append(all_q_marked[i][0])
    i=i+1

# finally remove duplicate lists in the list
my_list_without_circular_duplicates = [list(t) for t in set(tuple(element) for element in formatted_my_list)]
print (my_list_without_circular_duplicates)

请注意，尽管它仍然会迭代并处理按字母顺序排序的单词（201），并标记整个all_q_（200704），但随着all_q_标记中的元素标记为“已排序”，处理所需的时间会呈指数级减少

网友

2楼 · 编辑于 2024-05-19 14:14:21

您的实现是一个n平方算法，这意味着对于大型数据集，实现时间将显著增长。200000平方是一个非常大的数字。您需要将其转换为顺序n或n-log（n）算法。要做到这一点，您需要预处理数据，以便检查循环等效项是否也在列表中，而无需搜索列表。要做到这一点，请将每个条目放入一个表格中，以便在不需要遍历列表的情况下进行比较。我建议您旋转每个条目，使其具有按字母顺序排列的第一项。例如，将['dog'，'cat'，'mat'，'fun']改为['cat'，'mat'，'fun'，'dog']。这是一个ordern操作，用于处理列表中的每个元素一次

然后，使用通用的格式，您可以选择几个选项来确定每个条目是否唯一。我会用一套。对于每个项目，检查项目是否在集合中，如果不在集合中，则检查项目是否唯一，并应将其添加到集合中。如果该项已在集合中，则已找到等效项，可以删除该项。在Python中，检查项是否在集合中是一个常量时间操作。它通过使用哈希表来索引以查找项目，而不需要搜索。结果是，这也是一个n阶操作，用于检查每个条目。总的来说，该算法的阶数为n，将比您所做的要快得多

相关问题更多 >

编程相关推荐

热门问题

热门文章