解析从XML到sqli的层次信息

2024-09-30 01:23:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我发布的脚本会很长,所以我请求您耐心等待。我相信对于那些熟悉XML结构复杂性的人来说,这是很容易解决的。我有一个高度分层的XML文件。我很乐意在此基础上制作sqlite数据库。到目前为止,我有两个文件从具有这种结构的XML文件中提取同级信息(注意,工作频率从0到4或5不等;节点也可以是空的):

<program>
    <id>28798031</id>
    <programID>12345</programID>
    <orchestra>New York Philarmonic</orchestra>
    <season>1842-43</season>
    <concertInfo>
        <eventType>Subscription Season</eventType>
        <Location>Manhattan, NY</Location>
        <Venue>Apollo Rooms</Venue>
        <Date>1842-12-07T05:00:00Z</Date>
        <Time>8:00PM</Time>
    </concertInfo>
    <worksInfo>
        <work ID="52446*">
            <composerName>Beethoven,  Ludwig  van</composerName>
            <workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle>
            <conductorName>Hill, Ureli Corelli</conductorName>
        </work>
        <work ID="8834*4">
            <composerName>Weber,  Carl  Maria Von</composerName>
            <workTitle>OBERON</workTitle>
            <movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement>
            <conductorName>Timm, Henry C.</conductorName>
            <soloists>
                <soloist>
                    <soloistName>Otto, Antoinette</soloistName>
                    <soloistInstrument>Soprano</soloistInstrument>
                    <soloistRoles>S</soloistRoles>
                </soloist>
            </soloists>
        </work>
    </worksInfo>
</program>

我有两个脚本从同级节点(相同的层次结构级别)提取数据。在

首先

^{pr2}$

import sqlite3
import xml.etree.cElementTree

def insert_work(db_conn, workID, composerName, workTitle, movement, conductorName):
curs = db_conn.cursor()
curs.execute("insert into work values (?,?,?,?,?)", (workID, composerName, workTitle, movement, conductorName))
db_conn.commit()

def work_data_from_element(element):
workID = element.get("ID")
if workID != None:
    workID = workID
else:
    workID = ''
composerName = element.find("composerName")
if composerName != None:
    composerName = composerName.text
else:
    composerName = ''
workTitle = element.find("workTitle")
if workTitle != None:
    workTitle = workTitle.text
else:
    workTitle = ''
movement = element.find("movement")
if movement != None:
    movement = movement.text
else:
    movement = ''
conductorName = element.find("conductorName")
if conductorName != None:
    conductorName = conductorName.text
else:
    conductorName = ''
return workID, composerName, workTitle, movement, conductorName


## add the main loop to get all the work information from the XML file
if __name__ == "__main__":
conn = sqlite3.connect("work.sqlite3")
programs = xml.etree.cElementTree.parse("complete.xml")
work = programs.findall("program/worksInfo/work")
for index, element in enumerate(work):
    workID, composerName, workTitle, movement, conductorName = work_data_from_element(element)
    insert_work(conn, workID, composerName, workTitle, movement, conductorName)

我的问题是——如何把这两者联系起来,这样我就可以知道programID、season、workID和conductorName等等,所有这些都在一起?谢谢您!!在


Tags: 文件textnoneifxmlelementfindconn
1条回答
网友
1楼 · 发布于 2024-09-30 01:23:29

无论何时处理需要为二维格式(如数据库表)展平的嵌套分层XML文件,请考虑使用XSLT解决方案。作为信息,XSLT是一种专门用于转换XML文件的编程语言,与大多数通用语言一样,Python可以处理XSLT1.0脚本,特别是使用lxml模块。在

因此,对于programworkSQLite表,可以考虑运行以下XSLT脚本来转换源XML文件,然后使用XPath(XSLT的同级)进行解析,以便在一个Python脚本中导入数据库。顺便说一句,XSLT是一个格式良好的XML文件,因此可以像任何其他XML一样从文件或字符串加载。XSLT的优点还在于,可以对可能存在/可能不存在的元素进行硬编码,从而为缺少的节点返回空文本。在

结构上的一些变化:

  1. 连接<concerntInfo>的所有子级(可能用作单独的表?)在
  2. 程序解析中删除了<worksInfo>,因为它反映在work表中
  3. 将相应的程序<id>添加到work解析中以用作表中的外键

编程XSLT(另存为.xsl文件,在Python中调用)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <xsl:template match="/">
    <program>
      <xsl:apply-templates select="program"/>
    </program>
  </xsl:template>

  <xsl:template match="program">
    <id><xsl:value-of select="id"/></id>
    <programID><xsl:value-of select="programID"/></programID>
    <orchestra><xsl:value-of select="orchestra"/></orchestra>
    <season><xsl:value-of select="season"/></season>
    <concertInfo><xsl:value-of select="concat(concertInfo/eventType, ' ',
                   concertInfo/Location, ' ', concertInfo/Venue, ' ', 
                   concertInfo/Date, ' ', concertInfo/Time)"/></concertInfo>
  </xsl:template>

</xsl:transform>

工作XSLT(另存为.xsl,在Python中调用)

^{pr2}$

Python脚本

import lxml.etree as ET
import sqlite3

def insert_program(db_conn, id, programID, orchestra, season, concertInfo):
    curs = db_conn.cursor()
    curs.execute("insert into program values (?,?,?,?,?,?)", (id, programID, orchestra, season, concertInfo))
    db_conn.commit()

def program_data_from_element(element):
    id = element.find("id").text
    programID = element.find("programID").text
    orchestra = element.find("orchestra").text
    season = element.find("season").text
    concertInfo = element.find("concertInfo").text

    return id, programID, orchestra, season, concertInfo

def insert_work(db_conn, programID, workID, composerName, workTitle, movement, conductorName):
    curs = db_conn.cursor()
    curs.execute("insert into work values (?,?,?,?,?)", (programID, workID, composerName, workTitle, movement, conductorName))
    db_conn.commit()

def work_data_from_element(element):
    programID = element.find("programID").text
    workID = element.find("workID").text
    composerName = element.find("composerName").text
    workTitle = element.find("workTitle").text
    movement = element.find("movement").text
    conductorName = element.find("conductorName").text

    return programID, workID, composerName, workTitle, conductorName, movement    

if __name__ == "__main__":    
    conn = sqlite3.connect("program.sqlite3")
    xml = ET.parse("complete.xml")

    # PROGRAM PARSE
    xslt = ET.parse("program.xsl")
    transform = ET.XSLT(xslt)
    newdom = transform(xml)
    program = newdom.xpath("//program")    

    for index, element in enumerate(program):
        id, programID, orchestra, season, concertInfo = program_data_from_element(element)
        insert_program(conn, id, programID, orchestra, season, concertInfo)

    # WORK PARSE
    xslt = ET.parse("work.xsl")
    transform = ET.XSLT(xslt)
    newdom = transform(xml)
    work = newdom.xpath("//work")    

    for index, element in enumerate(work):
        programID, workID, composerName, workTitle, conductorName, movement = work_data_from_element(element)
        insert_work(conn, programID, workID, composerName, workTitle, conductorName, movement)

相关问题 更多 >

    热门问题