将XML文件(TED Europe)提取到datafram

2024-10-04 07:34:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个来自欧洲TED的XML(比如:TED Europa XML Files (Login required)),在XML文件中是公共采购合同。你知道吗

我现在的问题是如何将XML文件解析为数据帧。 到目前为止,我尝试使用ElementTree包来实现这一点。你知道吗

但是,由于我还是一个初学者,提取信息时会遇到麻烦,因为相关文本只标记了“p”标记。你知道吗

我如何为英语翻译提取这些信息,例如“TI\u MARK”是列标题,“TXT\u MARK”和“p”标记中的信息是行?其他行稍后将填充来自其他公共采购XML文件的信息。你知道吗

<FORM_SECTION>
  <OTH_NOT LG="DA" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
  <OTH_NOT LG="DE" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
  <OTH_NOT LG="EN" VERSION="R2.0.8.S03.E01" CATEGORY="ORIGINAL">
        <FD_OTH_NOT>
          <TI_DOC>
            <P>BE-Brussels: IPA - Improved implementation of animal health, food safety and phytosanitary legislation and corresponding information systems</P>
          </TI_DOC>
          <STI_DOC>
            <P>Location — The former Yugoslav Republic of Macedonia</P>
          </STI_DOC>
          <STI_DOC>
            <P>SERVICE CONTRACT NOTICE</P>
          </STI_DOC>
          <CONTENTS>
            <GR_SEQ>
              <TI_GRSEQ>
                <BLK_BTX/>
              </TI_GRSEQ>
              <BLK_BTX_SEQ>
                <MARK_LIST>
                  <MLI_OCCUR NO_SEQ="001">
                    <NO_MARK>1.</NO_MARK>
                    <TI_MARK>Publication reference</TI_MARK>
                    <TXT_MARK>
                      <P>EuropeAid/139253/DH/SER/MK</P>
                    </TXT_MARK>
                  </MLI_OCCUR>
                  <MLI_OCCUR NO_SEQ="002">
                    <NO_MARK>2.</NO_MARK>
                    <TI_MARK>Procedure</TI_MARK>
                    <TXT_MARK>
                      <P>Restricted</P>
                    </TXT_MARK>
                  </MLI_OCCUR>

到目前为止,我的代码是:

import xml.etree.cElementTree as ET
tree = ET.parse('196658_2018.xml')

#Print Tree
print(tree)

#tree=ET.ElementTree(file='196658_2018.xml')
root = tree.getroot()

#Print root
print(root)

for element in root.findall('{ted/R2.0.8.S03/publication}FORM_SECTION/{ted/R2.0.8.S03/publication}OTH_NOT/{ted/R2.0.8.S03/publication}FD_OTH_NOT/{ted/R2.0.8.S03/publication}TI_DOC/{ted/R2.0.8.S03/publication}P'):

    print(element.text)

奇怪的是,只有将{ted/R2.0.8.S03/publication}添加到每个path元素中,提取才会起作用。你知道吗

接下来,我在编写一个函数时遇到了一些问题,这个函数包含所有带有infos的路径,并将它们附加到一个dataframe中。理想情况下,只应提取英文译文。你知道吗

对于XML文件的另一部分,我使用了如下函数:

from lxml import etree
import pandas as pd
import xml.etree.ElementTree as ET

def parse_xml_fields(file, base_tag, tag_list, final_list):
    root = etree.parse(file)
    nodes = root.findall("//{}".format(base_tag))
    for node in nodes:
        item = {}
        for tag in tag_list:
            if node.find(".//{}".format(tag)) is not None:
                item[tag] = node.find(".//{}".format(tag)).text.strip()
        final_list.append(item)

# My variables
field_list = ["{ted/R2.0.8.S03/publication}TI_CY","{ted/R2.0.8.S03/publication}TI_TOWN", "{ted/R2.0.8.S03/publication}TI_TEXT"]
entities_list = []

parse_xml_fields("196658_2018.xml", "{ted/R2.0.8.S03/publication}ML_TI_DOC", field_list, entities_list)


df = pd.DataFrame(entities_list, columns=field_list)
print(df)

#better column names
df.columns = ['Country', 'Town', 'Text']

df.to_csv("TED_Europa_List.csv", sep=',', encoding='utf-8')

但是,对于本节来说,路径和标记更容易区分,因为标记已经以其内容命名,而且标记更容易区分。你知道吗


Tags: no标记doctagtinotrootxml