如何使用lxml高效地解析这个包含嵌套元素的巨大XML文件？问题的回答

如何使用lxml高效地解析这个包含嵌套元素的巨大XML文件？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我尝试使用<a href="https://stackoverflow.com/questions/7327924/how-to-efficiently-store-this-parsed-xml-document-in-mysql-database-using-python">XML minidom</a>解析这个巨大的XML文档。虽然它在一个示例文件上运行得很好，但在尝试处理实际文件（大约400MB）时，它阻塞了系统。在 我尝试从<a href="https://codereview.stackexchange.com/questions/2449/parsing-huge-xml-file-with-lxml-etree-iterparse-in-python">codereview</a>调整代码（它以流方式处理数据，而不是一次在内存中加载数据），对于我的xml文件，由于元素的嵌套性质，我很难隔离数据集。我以前处理过简单的XML文件，但没有处理像这样的内存密集型任务。在 这是正确的方法吗？如何将库存和出版商标识与每本书相关联？这就是我计划最终将这两个表关联起来的方式。在 如有任何反馈，我们将不胜感激。在 在图书.xml在 <pre><code><BookDatabase> <BookHeader> <Name>BookData</Name> <BookUniverse>All</BookUniverse> <AsOfDate>2010-05-02</AsOfDate> <Version>1.1</Version> </BookHeader> <InventoryBody> <Inventory ID="12"> <PublisherClass ID="34"> <Publisher> <PublisherDetails> <Name>Microsoft Press</Name> <Type>Tech</Type> <ID>7462</ID> </PublisherDetails> </Publisher> </PublisherClass> <BookList> <Listing> <BookListSummary> <Date>2009-01-30</Date> </BookListSummary> <Book> <BookDetail ID="67"> <BookName>Code Complete 2</BookName> <Author>Steve McConnell</Author> <Pages>960</Pages> <ISBN>0735619670</ISBN> </BookDetail> <BookDetail ID="78"> <BookName>Application Architecture Guide 2</BookName> <Author>Microsoft Team</Author> <Pages>496</Pages> <ISBN>073562710X</ISBN> </BookDetail> </Book> </Listing> </BookList> </Inventory> <Inventory ID="64"> <PublisherClass ID="154"> <Publisher> <PublisherDetails> <Name>O'Reilly Media</Name> <Type>Tech</Type> <ID>7484</ID> </PublisherDetails> </Publisher> </PublisherClass> <BookList> <Listing> <BookListSummary> <Date>2009-03-30</Date> </BookListSummary> <Book> <BookDetail ID="98"> <BookName>Head First Design Patterns</BookName> <Author>Kathy Sierra</Author> <Pages>688</Pages> <ISBN>0596007124</ISBN> </BookDetail> </Book> </Listing> </BookList> </Inventory> </InventoryBody> </BookDatabase> </code></pre> Python代码： ^{pr2}$ Python输出： <pre><code>$ python lxmletree_book.py ========> 0 <======= ========> 1 <======= {'ID': '12', 'element': 'Inventory'} [] ========> 2 <======= {'ID': '34', 'element': 'PublisherClass'} [] ========> 3 <======= {'Name': <Element Name at 0x105140af0>, 'Type': <Element Type at 0x105140b40>, 'ID': <Element ID at 0x105140b90>, 'element': 'PublisherDetails'} [] ========> 4 <======= {'ID': None, 'element': 'PublisherDetails'} [] ========> 5 <======= {'ID': None, 'element': 'PublisherClass'} [] ========> 6 <======= {'ISBN': <Element ISBN at 0x105140eb0>, 'Name': <Element Name at 0x105140dc0>, 'Author': <Element Author at 0x105140e10>, 'ID': '67', 'element': 'BookDetail', 'Pages': <Element Pages at 0x105140e60>} [] ========> 7 <======= {'ID': None, 'element': 'BookDetail'} [] ========> 8 <======= {'ISBN': <Element ISBN at 0x1051460a0>, 'Name': <Element Name at 0x105140f50>, 'Author': <Element Author at 0x105140fa0>, 'ID': '78', 'element': 'BookDetail', 'Pages': <Element Pages at 0x105146050>} [] ========> 9 <======= {'ID': None, 'element': 'BookDetail'} [] ========> 10 <======= {'ID': None, 'element': 'Inventory'} [] ========> 11 <======= {'ID': '64', 'element': 'Inventory'} [] ========> 12 <======= {'ID': '154', 'element': 'PublisherClass'} [] ========> 13 <======= {'Name': <Element Name at 0x105146230>, 'Type': <Element Type at 0x105146280>, 'ID': <Element ID at 0x1051462d0>, 'element': 'PublisherDetails'} [] ========> 14 <======= {'ID': None, 'element': 'PublisherDetails'} [] ========> 15 <======= {'ID': None, 'element': 'PublisherClass'} [] ========> 16 <======= {'ISBN': <Element ISBN at 0x1051465f0>, 'Name': <Element Name at 0x105146500>, 'Author': <Element Author at 0x105146550>, 'ID': '98', 'element': 'BookDetail', 'Pages': <Element Pages at 0x1051465a0>} [] ========> 17 <======= {'ID': None, 'element': 'BookDetail'} [] ========> 18 <======= {'ID': None, 'element': 'Inventory'} [] ========> 19 <======= </code></pre> 期望的输出（最终存储在MySQL中—现在是Python中的列表）： <pre><code>Publishers InventoryID PublisherClassID Name Type ID 12 34 Microsoft Press Tech 7462 64 154 O'Reilly Media Tech 7484 Books PublisherID BookDetailID Name Author Pages ISBN 7462 67 Code Complete 2 Steve McConnell 960 0735619670 7462 78 Application Architecture Guide 2 Microsoft Team 496 073562710X 7484 98 Head First Design Patterns Kathy Sierra 688 0596007124 </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

你可以试试这样的方法： <pre><code>import MySQLdb from lxml import etree import config def fast_iter(context, func, args=[], kwargs={}): # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ # Author: Liza Daly for event, elem in context: func(elem, *args, **kwargs) elem.clear() while elem.getprevious() is not None: del elem.getparent()[0] del context def extract_paper_elements(element,cursor): pub={} pub['InventoryID']=element.attrib['ID'] try: pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0] except IndexError: pub['PublisherClassID']=None pub['PublisherClassID']=element.xpath('PublisherClass/@ID')[0] for key in ('Name','Type','ID'): try: pub[key]=element.xpath( 'PublisherClass/Publisher/PublisherDetails/{k}/text()'.format(k=key))[0] except IndexError: pub[key]=None sql='''INSERT INTO Publishers (InventoryID, PublisherClassID, Name, Type, ID) VALUES (%s, %s, %s, %s, %s) ''' args=[pub.get(key) for key in ('InventoryID', 'PublisherClassID', 'Name', 'Type', 'ID')] print(args) # cursor.execute(sql,args) for bookdetail in element.xpath('descendant::BookList/Listing/Book/BookDetail'): pub['BookDetailID']=bookdetail.attrib['ID'] for key in ('BookName', 'Author', 'Pages', 'ISBN'): try: pub[key]=bookdetail.xpath('{k}/text()'.format(k=key))[0] except IndexError: pub[key]=None sql='''INSERT INTO Books (PublisherID, BookDetailID, Name, Author, Pages, ISBN) VALUES (%s, %s, %s, %s, %s, %s) ''' args=[pub.get(key) for key in ('ID', 'BookDetailID', 'BookName', 'Author', 'Pages', 'ISBN')] # cursor.execute(sql,args) print(args) def main(): context = etree.iterparse("book.xml", events=("end",), tag='Inventory') connection=MySQLdb.connect( host=config.HOST,user=config.USER, passwd=config.PASS,db=config.MYDB) cursor=connection.cursor() fast_iter(context,extract_paper_elements,args=(cursor,)) cursor.close() connection.commit() connection.close() if __name__ == '__main__': main() </code></pre> <ol> <li>不要使用<code>fast_iter2</code>。<a href="http://www.ibm.com/developerworks/xml/library/x-hiperfparse/" rel="nofollow">original ^{<cd2>}</a>分隔来自特定处理函数的有用工具（<code>extract_paper_elements</code>）。<code>fast_iter2</code>将两者混合在一起没有可重复的代码。在</li> <li>如果在<code>etree.iterparse("book.xml", events=("end",), tag='Inventory')</code>中设置<code>tag</code>参数，则处理函数 <code>extract_paper_elements</code>将只看到<code>Inventory</code>元素。在</li> <li>给定一个Inventory元素，您可以使用<code>xpath</code>方法进行挖掘把所需的数据收集下来。在</li> <li><code>args</code>和<code>kwargs</code>参数被添加到<code>fast_iter</code>所以<code>cursor</code> 可以传递给<code>extract_paper_elements</code>。在</li> </ol>

如何使用lxml高效地解析这个包含嵌套元素的巨大XML文件？

1 个回答

相关Python问题