从html文档中提取标记内的文本 - 问答 - Python中文网

从html文档中提取标记内的文本

2024-09-29 23:28:04 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我有这样一个html文档：https://dropmefiles.com/wezmb 所以我需要提取标签中的文本<；span id="1" and </span , but I don't know how. I'm trying and write this code:

from bs4 import BeautifulSoup

with open("10_01.htm") as fp:
    soup = BeautifulSoup(fp,features="html.parser")
    for a in soup.find_all('span'):
      print (a.string)

但它从所有“span”标记中提取所有信息。那么，如何提取标记内的文本<；span id="1" and </span in Python?

Tags： and in 文档 https 标记文本 lt com

1条回答

网友

1楼 · 发布于 2024-09-29 23:28:04

您需要的是.contents函数documentation

使用以下命令查找span<span id = "1"> ... </span>

for x in soup.find(id = 1).contents:
    print(x)

或

x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)

这将为您提供：

也就是说，一个空行后面跟着10，后面跟着另一个空行。这是因为HTML中的字符串实际上是这样的，并在新行中打印10，正如您在HTML中也可以看到的那样，10有其单独的行。
字符串将正确为'\n10\n'

如果您只想从x = '\n10\n'中x = '10'，您可以这样做：x = x[1:-1]，因为'\n'是单个字符。希望这有帮助

相关问题更多 >

编程相关推荐

热门问题

热门文章