如何从tweet中提取或获取所有缩短的url？ - 问答 - Python中文网

如何从tweet中提取或获取所有缩短的url？

2024-09-30 14:17:46 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我想从tweets中提取简短的url。这些URL遵循标准格式：http://t.co（details here）

为此，我使用了以下regex表达式，当我用tweet文本测试它时，它工作得很好，只需将文本存储为字符串。在

注意： 我使用的是https://shortnedurl/string而不是真正的缩短的URL，因为StackOverflow不允许在这里发布这样的URL。在

样本代码：

import re

tweet = "Grim discovery in the USS McCain collision probe https://shortnedurl.com @MattRiversCNN reports #TheLead"

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                  tweet)
for url in urls:
    print "printing urls", url

此代码的输出：

^{pr2}$

然而，当我使用twitter的API从twitter读取tweet并在其上运行相同的regex时，我得到了以下不受欢迎的输出。在

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string</a></span>
printing urls https://twitter.com/MattRiversCNN
printing urls https://twitter.com/search?q=%23TheLead

它看起来就像是一个标签一样。在

我该如何处理这个问题？我只想读取这些http://t.courl。在

更新1: 我试过https？：//t.co/\S*，但是，我仍然收到以下嘈杂的url：

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

我不知道为什么同一个URL又被找到了</a><span>。在

对于https？：//t.co/\S+，我得到的URL无效，因为它将上述两个URL合并为一个：

printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

更新2:tweet文本看起来与我预期的有所不同：

    Grim discovery in the USS McCain collision probe 
<span class="link"><a href="https://shortenedurl">https://shortenedurl</a></span> <span class="username"><a 
href="https://twitter.com/MattRiversCNN">@MattRiversCNN</a></span>
     reports <span class="tag"><a href="https://twitter.com/search?
    q=%23TheLead">#TheLead</a></span>

Tags： https 文本 com http url string twitter urls

2条回答

网友

1楼 · 编辑于 2024-09-30 14:17:46

如果我理解正确，只需将要包含的字符串放入正则表达式中，如下所示：

https?://shortnedurl.com/\S*
# look for http or https:://
# shortnedurl.com/ literally
# followed by anything not a whitespace character, 0+

参见a demo on regex101.com。
对于您的特殊情况：

^{pr2}$

网友

2楼 · 编辑于 2024-09-30 14:17:46

你可以使用正则表达式

https?://t\.co/\S+

相关问题更多 >

编程相关推荐

热门问题

热门文章