在嵌套循环中按索引删除行时出错

import json import pandas as pd import pyreadline import pprint from difflib import SequenceMatcher # Note, this file, 'tweetsR.json', was originally csv, but has been translated to json. with open("twitter data/tweetsR.json", "r") as read_file: data = json.load(read_file) # Load the source data set, esport tweets. df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities. df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note, these tweets are likely reposts/retweets, etc. df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates. def duplicates(df): for ind in df.index: a = df['text'][ind] for indd in df.index: if indd != 26747: # Trying to prevent an overstep keyError here b = df['text'][indd+1] if similar(a,b) >= 0.80: df.drop((indd+1), inplace=True) print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed duplicates(df)

1条回答

网友

1楼 · 发布于 2024-06-23 19:41:41

@KazuyaHatta提到的一个解决方案是itertools.组合(). 虽然，我使用它的方式（可能还有另一种方式）是O（n^2）。因此，在本例中，有27000个元组，需要迭代的组合接近357714378个（太长）。你知道吗

代码如下：

# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
    # Find out how to improve the speed of this
    excludes = set()
    combos = itertools.combinations(df.index, 2)
    for combo in combos:
        if str(combo) not in excludes:
            if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
                excludes.add(f'{combo[0]}, {combo[1]}') 
                excludes.add(f'{combo[1]}, {combo[0]}')
                print("Dropped: " + str(combo))
                print(len(excludes))

duplicates(df)

我的下一步，正如@KazuyaHatta所描述的，是尝试通过蒙版的方法进行投放。你知道吗

注意：很遗憾，我无法发布数据集的样本。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章