将XML格式的网站完全转换为数据框架

2024-10-01 05:01:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将以下网站转换为数据框架,以便处理数据:https://www.ifsqn.com/forum/index.php/rss/forums/4-food-safety-quality-discussion/

无论我在网上看到什么地方,我都只看到如何将XML文件转换为数据帧。我尝试了以下方法,但它不起作用,因为它不是XML文件。我可以自己做熊猫部分,但首先,需要有数据来处理

import requests
import xml.etree.ElementTree as ET

headers = {'User-Agent': 'Mozilla/5.0'}

r = requests.get("https://www.ifsqn.com/forum/index.php/rss/forums/4-food-safety-quality-discussion/",headers=headers)

c = r.content

root = ET.parse(r).getroot()

print(root)

我在这里遗漏了哪些步骤来将XML转换为可读格式,从而将数据转换为数据帧

非常感谢您的任何意见


Tags: 数据httpscomindexfoodwwwforumxml
1条回答
网友
1楼 · 发布于 2024-10-01 05:01:55

要解析的XML是RSS,因为它具有特定的格式,所以可以使用解析RSS提要的python库(feedparser作为示例)

import feedparser
import pandas as pd

parsed_rss = feedparser.parse('https://www.ifsqn.com/forum/index.php/rss/forums/4-food-safety-quality-discussion/')

pd.DataFrame(parsed_rss['entries'])
                                                title                                       title_detail  ...                                                 id guidislink
0                      Monitored vs Verifying Records  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
1   Is it necessary to follow the new ISO 22000 to...  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
2                      usda inspector tagging product  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
3                              Chocolate Liquor Discs  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
4                              Multi-Pack Beef Sticks  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
..                                                ...                                                ...  ...                                                ...        ...
95  HACCP Pan for super critical fluid extraction ...  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
96               Illegal Drugs Pictured on Food Label  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
97    BRC metal can packaging compliance requirements  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
98  Codex Decision tree in ISO 22000:2018 - Clause...  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
99           BRC clause 4.3.4 - Battery Charging area  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False

[100 rows x 10 columns]

另一种方法是自己将XML解析为某种结构,该结构可用于构造数据帧Example here

编辑:

现在我看到您在以下行中传递了r而不是c

root = ET.parse(r).getroot()

相关问题 更多 >