将大型XML文件拆分为多个文件

from lxml import etree context = etree.iterparse('Posts.xml', tag='row', events=('end', )) index = 0 count = 0 full_text = b"" for event, elem in context: count += 1 full_text += etree.tostring(elem) if count >= 1000000 : count = 0 index += 1 filename = format(str(index) + ".xml") with open(filename, 'wb') as f: f.write(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n") f.write(b"<root>\n") f.write(full_text) f.write(b"</root>") full_text = b"" with open(format(str(index+1)+".xml"), 'wb') as f: f.write(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n") f.write(b"<root>\n") f.write(full_text) f.write(b"</root>")

1条回答

网友

1楼 · 发布于 2024-09-17 02:11:56

我不可能把你一路带到那里，但下面是我将如何处理它。你知道吗

我首先要说：

from lxml import html
import lxml.etree as le

tree = html.fromstring(content) #content would be your whole file

然后我会计算tree中的节点数，方法如下：

num_nodes = tree.xpath("count(//book)") #'book' in your case would be whatever the critical item is

一旦你有了这个数字，我就决定要把这些节点分成多少个文件。假设您有12个节点，并决定将它们划分为3个文件，节点1-4位于file 1，节点5-8位于file 2，等等。让我们关注file 2：

从tree中，需要选择分配给file 2的位置中的节点。所以，对于这个文件：

low_pos=5
hi_pos=8
items = tree.xpath('//book[position()>=low_pos and position()<=hi_pos]')

这应该选择相关节点及其所有标记、文本等

最后，你拿着每件物品，做你想做的：

for item in items:
    print(le.tostring(item).decode('utf-8'))#or write or whatever

很明显，在你的案例中要实现它需要很多工作，但希望这至少是一个开始。。。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章