Python从xml中提取数据

2024-09-29 21:30:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从此网页获取值:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://tempuri.org/">
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-01T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28671555</Value>
<ValueDetail>4415</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-02T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28675970</Value>
<ValueDetail>4279</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-03T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28680249</Value>
<ValueDetail>3975</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-04T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28684224</Value>
<ValueDetail>4236</ValueDetail>
</vwHistoryDetail>
</ArrayOfVwHistoryDetail>

我使用以下代码进行了测试:

import xml.etree.ElementTree as ET
from urllib import request


url = "http://SomeSite/WebService.asmx/LoadVariableHistory?username=USERNAME&password=PASSWORD&variableName=CBT2_G_PRM_FB2&startDateTime=2020-12-01&endDateTime=2020-12-02&sampling=3"

print ("Obter: ", url)
html = request.urlopen(url)
data = html.read()
print("Obtido: ",len(data),"caracteres")

tree = ET.fromstring(data)
results = tree.findall('Value')
for i in results:
  print(i)

出于安全原因,我隐藏了完整的URL。 我做错了什么,没有得到这些值?我需要完成这部分,这样我就可以用DataTime:Value构建一个字典

先谢谢你


Tags: orgtreehttpurldatadatetimevaluewww
3条回答

您当前的实施中出现了几个问题:

  • 您的XML包含一个默认名称空间xmlns="http://tempuri.org/",它要求您定义一个前缀以解析节点内容findall维护一个名称空间参数
  • 路径表达式假定Value是root的子级。您需要使用双斜杠路径.//,因为Value是root的后代
  • 您需要提取迭代器变量的text。否则,您将返回<Element ... >对象,这在最终使用需求中通常不有用
<考虑调整

tree = ET.fromstring(data)
nmsp = {'doc': 'http://tempuri.org/'}                         # NAMESPACE PREFIX ASSIGNMENT
results = tree.findall('.//doc:Value', namespaces = nmsp)     # NAMESPACE PREFIX USE WITH './/' PATH 
for i in results:
  print(i.text)                                               # RETRIEVE TEXT VALUE

# 28671555
# 28675970
# 28680249
# 28684224

更好的是,返回一个包含.Value及其同级的字典(其中split删除dict键中的默认名称空间):

data_list_of_dicts = [{i.tag.split('}')[-1]: i.text for i in hd} 
                        for hd in tree.findall('.//doc:vwHistoryDetail', namespaces = nmsp)]

print(data_list_of_dicts)
# [{'idVariable': '2561', 'DateTime': '2020-12-01T00:00:00', 'idPeriodType': '1', 'Value': '28671555', 'ValueDetail': '4415'}, 
#  {'idVariable': '2561', 'DateTime': '2020-12-02T00:00:00', 'idPeriodType': '1', 'Value': '28675970', 'ValueDetail': '4279'}, 
#  {'idVariable': '2561', 'DateTime': '2020-12-03T00:00:00', 'idPeriodType': '1', 'Value': '28680249', 'ValueDetail': '3975'}, 
#  {'idVariable': '2561', 'DateTime': '2020-12-04T00:00:00', 'idPeriodType': '1', 'Value': '28684224', 'ValueDetail': '4236'}]

对于时间键控值字典:

time_value_dict = {hd.find('doc:DateTime', namespaces=nmsp).text: 
                   hd.find('doc:Value', namespaces=nmsp).text 
                      for hd in tree.findall('.//doc:vwHistoryDetail', namespaces=nmsp)}

print(time_value_dict)
# {'2020-12-01T00:00:00': '28671555', 
#  '2020-12-02T00:00:00': '28675970', 
#  '2020-12-03T00:00:00': '28680249', 
#  '2020-12-04T00:00:00': '28684224'}

Online Demo

tree = ET.fromstring(data)
for detail in tree.findall('vwHistoryDetail'):
  v = detail.find('Value').text
  print(v)

最好通过对象循环并提取子元素,而不是直接获取子元素,因为Value可能是在文档的不同部分重用的标记

见下文

import xml.etree.ElementTree as ET
import re

#
xml = '''<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema"
                                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                 xmlns="http://tempuri.org/">
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-01T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28671555</Value>
      <ValueDetail>4415</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-02T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28675970</Value>
      <ValueDetail>4279</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-03T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28680249</Value>
      <ValueDetail>3975</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-04T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28684224</Value>
      <ValueDetail>4236</ValueDetail>
   </vwHistoryDetail>
</ArrayOfVwHistoryDetail>'''
xml = re.sub(' xmlns="[^"]+"', '', xml, count=1)
root = ET.fromstring(xml)
data = {v.find('DateTime').text: v.find('Value').text for v in root.findall('.//vwHistoryDetail')}
print(data)

输出

{'2020-12-01T00:00:00': '28671555', '2020-12-02T00:00:00': '28675970', '2020-12-03T00:00:00': '28680249', '2020-12-04T00:00:00': '28684224'}

相关问题 更多 >

    热门问题