我试图删除字典中的重复项,但仅基于文本值中的重复项
例如,我想删除tweets列表中的重复项:
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://example.com/EcSHCAm9Nn', 'id': 634092907243393024L}
{'text': 'RT Iran deal opponents now have their "death panels" lie, and it\'s a whopper http://example.com/ntECOXorvK via @voxdotcom #IranDeal', 'id': 634068454207791104L}
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://example.com/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812L}
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://example.com/QD43vbJft6 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633495091584323584L}
{'text': "RT : Iran Deal's Surprising Supporters: https://example.com/pUG7vht0fE catoletters: Iran Deal's Surprising Supporters: http://example.com/dhdylTNgoG", 'id': 633083989180448768L}
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://example.com/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://example.com/sTBhL12llF", 'id': 632525323733729280L}
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://example.com/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://example.com/sTBhL12llF", 'id': 632385798277595137L}
{'text': "RT : Iran Deal's Surprising Supporters: https://example.com/hOUCmreHKA catoletters: Iran Deal's Surprising Supporters: http://example.com/bJSLhd9dqA", 'id': 632370745088323584L}
{'text': '#News #RT Iran deal debate devolves into clash over Jewish stereotypes and survival - W... http://example.com/foU0Sz6Jej http://example.com/WvcaNkMcu3', 'id': 631952088981868544L}
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}}
要得到这个:
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://example.com/EcSHCAm9Nn', 'id': 634092907243393024L}
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}}
到目前为止,我找到的答案大多基于“普通”字典,其中重复的键/值是相同的。就我而言,这是一本合并词典。由于转发,文本键是相同的,但是相应的tweet id是不同的
这是完整的代码,任何提示写在一个csv文件在一个更有效的方式(使删除重复更容易)推文是mor比欢迎。你知道吗
import csv
import codecs
tweet_text_id = []
from TwitterSearch import TwitterSearchOrder, TwitterUserOrder, TwitterSearchException, TwitterSearch
try:
tso = TwitterSearchOrder()
tso.set_keywords(["Iran Deal"])
tso.set_language('en')
tso.set_include_entities(False)
ts = TwitterSearch(
consumer_key = "aaaaa",
consumer_secret = "bbbbb",
access_token = "cccc",
access_token_secret = "dddd"
)
for tweet in ts.search_tweets_iterable(tso):
tweet_text_id.append({'id':tweet['id'], 'text': tweet['text'].encode('utf8')});
fieldnames = ['id', 'text']
tweet_file = open('tweets.csv', 'wb')
csvwriter = csv.DictWriter(tweet_file, delimiter=',', fieldnames=fieldnames)
csvwriter.writerow(dict((fn,fn) for fn in fieldnames))
for row in tweet_text_id:
csvwriter.writerow(row)
tweet_file.close()
except TwitterSearchException as e:
print(e)
我制作了一个模块,可以过滤掉重复的实例,并在途中删除hashtag”
只需导入这个在您的脚本和运行它过滤掉推文!你知道吗
你可以尝试根据tweet之间的“编辑距离”来比较tweet。下面是我用fuzzyfuzzy[1]来比较tweet的破解方法:
输出:
[1]https://github.com/seatgeek/fuzzywuzzy
相关问题 更多 >
编程相关推荐