<p>你可以尝试根据tweet之间的“编辑距离”来比较tweet。下面是我用fuzzyfuzzy[1]来比较tweet的破解方法:</p>
<pre><code>from fuzzywuzzy import fuzz
def clean_tweet(tweet):
"""very crude. You can improve on this!"""
tweet['text'] = tweet['text'].replace("RT :", "")
return tweet
def is_unique(tweet, seen_tweets):
for seen_tweet in seen_tweets:
ratio = fuzz.ratio(tweet['text'], seen_tweet['text'])
if ratio > DUP_THRESHOLD:
return False
return True
def dedup(tweets, threshold=50):
deduped = []
for tweet in tweets:
cleaned = clean_tweet(tweet)
if is_unique(cleaned, deduped):
deduped.append(cleaned)
return deduped
if __name__ == "__main__":
DUP_THRESHOLD = 30
tweets = [
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024},
{'text': 'RT Iran deal opponents now have their "death panels" lie, and it\'s a whopper http://t.co/ntECOXorvK via @voxdotcom #IranDeal', 'id': 634068454207791104},
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812},
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/QD43vbJft6 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633495091584323584},
{'text': "RT : Iran Deal's Surprising Supporters: https://t.co/pUG7vht0fE catoletters: Iran Deal's Surprising Supporters: http://t.co/dhdylTNgoG", 'id': 633083989180448768},
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632525323733729280},
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632385798277595137},
{'text': "RT : Iran Deal's Surprising Supporters: https://t.co/hOUCmreHKA catoletters: Iran Deal's Surprising Supporters: http://t.co/bJSLhd9dqA", 'id': 632370745088323584},
{'text': '#News #RT Iran deal debate devolves into clash over Jewish stereotypes and survival - W... http://t.co/foU0Sz6Jej http://t.co/WvcaNkMcu3', 'id': 631952088981868544},
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184},
]
deduped = dedup(tweets, threshold=DUP_THRESHOLD)
print deduped
</code></pre>
<p>输出:</p>
<pre><code>[
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024L},
{'text': ' Iran deal quietly picks up some GOP backers via https://t.co/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812L}
]
</code></pre>
<p>[1]<a href="https://github.com/seatgeek/fuzzywuzzy" rel="nofollow">https://github.com/seatgeek/fuzzywuzzy</a></p>