提取自定义XML标记

2024-10-02 22:36:55 发布

您现在位置:Python中文网/ 问答频道 /正文

以下是XML文件的项标记的内容。如何使用BeautifulSoup提取media:content标记

<item>
            <title>How Kerala is preparing for monsoon amid the COVID-19 pandemic</title>
            <link/>https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007
                  <description>Usually, Kerala begins its procedure for monsoon preparedness by January. This year, however, the officials got busy with preparing for a health crisis instead. “Kerala works six months and fights the monsoon in the other six months,” says Sekhar Kuriakose, member secretary of the Kerala State Disaster Management Authority (KSDMA). Usually, Kerala begins its monsoon preparedness by January, even before the India Meteorological Department (IMD) makes its first long-range forecast for southwe...</description>
            <pubdate>Thu, 21 May 2020 10:30:00 GMT</pubdate>
            <guid>https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007</guid>
            <media:content medium="image" url="https://www.thenewsminute.com/sites/default/files/Kerala-rain-trivandrum-1200.jpg" width="600"></media:content>
</item>

Tags: thehttps标记comforwwwcontentmedia
1条回答
网友
1楼 · 发布于 2024-10-02 22:36:55

您的问题可能是BS4如何使用您正在使用的解析器后端处理名称空间。指定“LXML”而不是“XML”允许您使用find()和find_all(),正如您在本例中所期望的那样

t与您提供的XML一起成为字符串

soup = BeautifulSoup(t, "xml")
print(soup.find_all("media:content"))

产生

[]

但是,通过使用LXML解析器,它能够找到以下元素:

soup = BeautifulSoup(t, "lxml")
print(soup.find_all("media:content"))

产生

[<media:content medium="image" (...)></media:content>]

相关问题 更多 >