使用BeautifulSoup从标题标签下提取数据？

网友

1楼 · 编辑于 2024-10-06 12:06:43

解析完html后：

data = BeautifulSoup(h,"html.parser")

按以下方式查找标题：

^{pr2}$

{cd1>找到两个引号。有很多方法可以做到这一点。我会使用正则表达式：

import re
match = re.search(r'".*"', title)
if match:
    print match.group(0)

您永远不会搜索"或任何其他&NAME;序列，因为beautifulGroup会将它们转换为它们表示的实际字符。在

编辑：

不捕获引号的Regex将是：

re.search(r'(?<=").*(?=")', title)

网友

2楼 · 编辑于 2024-10-06 12:06:43

下面是一个使用正则表达式提取引号内文本的简单完整示例：

import urllib
import re
from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"

r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))

这里的情况是，在获取页面的源代码并找到title之后，我们对标题使用正则表达式来提取引号中的文本。在

我们告诉正则表达式在开始引号（\"）之前的字符串开头（^.*）查找任意数量的符号，然后捕获它和右引号（第二个\"）之间的文本。在

然后，我们通过告诉Python打印第一个捕获的组（regex中括号之间的部分）来打印捕获的文本。在

下面是关于在python中匹配regex的更多信息-https://docs.python.org/3/library/re.html#match-objects

网友

3楼 · 编辑于 2024-10-06 12:06:43

只需在冒号上拆分文本：

In [1]:  h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>"""

In [2]: from bs4 import BeautifulSoup

In [3]: soup  = BeautifulSoup(h, "lxml")

In [4]: print(soup.title.text.split(": ", 1)[1])
 "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

实际上，查看一点也不需要拆分的页面，文本位于内的p标记中分区js-tweet文本容器，th:

^{pr2}$

所以你可以用任何一种方法得到同样的结果。在

相关问题更多 >

编程相关推荐

热门问题

热门文章