如何使用正则表达式移除tweet的标签@user和链接

2024-07-05 14:22:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要使用Python对tweets进行预处理。现在我想知道什么是正则表达式来分别删除tweet的所有标签,@user和链接?

例如

  1. original tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4
    • 已处理的推文:I really love that shirt at Macy
  2. 原创微博:@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx
    • 已处理的tweet:Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve
  3. 原创微博:I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
    • 已处理的推文:I am at Starbucks 7419 3rd ave at 75th Brooklyn

我只需要在每一条推特上都写些有意义的话。我不需要用户名,或任何链接或任何标点符号。


Tags: httpthat链接havelyattweetcould
3条回答

这将适用于您的示例。如果你的tweets中有链接,它将失败,

result = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", subject)

编辑:

也适用于内部链接,只要它们被空间分隔开。

只需使用API。为什么要重新发明轮子?

下面的例子是一个近似值。不幸的是,仅仅通过正则表达式是没有正确方法的。下面的regex只是一个URL(不仅仅是http)、任何标点符号、用户名或任何非字母数字字符的条带。它还用一个空格分隔这个词。如果你想在你打算的时候解析tweet,你需要系统中更多的智能。一些考虑到没有标准tweet feed格式的预感自学习算法。

这是我的提议。

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())

下面是你例子的结果

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>> 

以下是一些不完美的例子

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>> 

有点晚了,但是这个解决方案防止了诸如hashtag1、hashtag2(不带空格)之类的标点错误,而且实现非常简单

import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

def strip_all_entities(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)


tests = [
    "@peter I really love that shirt at #Macy. http://bet.ly//WjdiW4",
    "@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx",
    "I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)",
]
for t in tests:
    strip_all_entities(strip_links(t))


#'I really love that shirt at'
#'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
#'I am at Starbucks 7419 3rd ave at 75th Brooklyn'

相关问题 更多 >