使用beauthulsoup decomposite()删除多个不需要的标记

2024-10-01 15:38:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试清理HTML,这样我所拥有的就是我想要的相关文本。下面的代码用第一个定义的函数清理上标标记,我不想做同样的事情,而是在使用.get_text之前使用'h4'、'h1'、'a'和'li'标记。在

import requests
from bs4 import BeautifulSoup
url = "https://www.biblegateway.com/passage/?search=Luke+14%3A12-14&version=NIV"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

def supClean(verseWithSup):
    for sup in verseWithSup:
        verseWithSup.sup.decompose()
    return verseWithSup

def verseExtract(soup):
    verseName = soup.find(class_="passage-display-bcv").get_text()
    verseWithSup = soup.find(class_="passage-text")
    verseBody = supClean(verseWithSup).get_text()
    return verseName, verseBody

verseName, verseBody = (verseExtract(soup))

print(verseName)
print(verseBody)

我现在明白了:

Luke 14:12-14New International Version (NIV) Then Jesus said to his host, “When you give a luncheon or dinner, do not invite your friends, your brothers or sisters, your relatives, or your rich neighbors; if you do, they may invite you back and so you will be repaid. But when you give a banquet, invite the poor, the crippled, the lame, the blind, 14 and you will be blessed. Although they cannot repay you, you will be repaid at the resurrection of the righteous.” Cross references:Luke 14:13 : ver 21 Luke 14:14 : Ac 24:15

New International Version (NIV) Holy Bible, New International Version®, NIV® Copyright ©1973, 1978, 1984, 2011 by Biblica, Inc.® Used by permission. All rights reserved worldwide.

但我只希望:

Then Jesus said to his host, “When you give a luncheon or dinner, do not invite your friends, your brothers or sisters, your relatives, or your rich neighbors; if you do, they may invite you back and so you will be repaid. But when you give a banquet, invite the poor, the crippled, the lame, the blind, 14 and you will be blessed. Although they cannot repay you, you will be repaid at the resurrection of the righteous.”

这是HTML的外观:

^{pr2}$

Tags: orthetextyouyourgetbeinvite
2条回答

虽然上面提供的答案是有效的,但我最终得到了这段代码,以便在bs4中实现我想要的功能。在

for item in soup.select("sup, div.publisher-info-bottom.with-single"):
    item.decompose()

然后我用下面的方法得到文本,并按我想要的方式格式化。在

^{pr2}$

试试看。如果你想摆脱1213和{}的诗句,那就告诉我。在

from bs4 import BeautifulSoup
import requests           

link= "https://www.biblegateway.com/passage/?search=Luke+14%3A12-14&version=NIV"

soup = BeautifulSoup(requests.get(link).text,"lxml")
for item in soup.select("[id^='en-NIV-']"):
    print(item.text.strip())

输出:

^{pr2}$

或者,要去掉诗句编号,您可以尝试如下:

import requests
from lxml.html import fromstring           

link= "https://www.biblegateway.com/passage/?search=Luke+14%3A12-14&version=NIV"  
root = fromstring(requests.get(link).text)
for item in root.cssselect("[id^='en-NIV-'],.woj"):
    for data in item:data.drop_tree()
    print(item.text_content())

结果:

Then Jesus said to his host, 
“When you give a luncheon or dinner, do not invite your friends, your brothers or sisters, your relatives, or your rich neighbors; if you do, they may invite you back and so you will be repaid.
But when you give a banquet, invite the poor, the crippled, the lame, the blind,
and you will be blessed. Although they cannot repay you, you will be repaid at the resurrection of the righteous.”

相关问题 更多 >

    热门问题