31GB文件中高效的python字符串替换

2024-09-28 19:08:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要定期检查不包含换行符并且经常包含XML语法问题的大型XML文件。这些文件非常大,最小容量为10GB

我的首选解决方案是将这些文件转换为每行一条记录,以便于处理(即,每当遇到记录关闭标记时,添加新行)。分块阅读是一种方法,但它也让我觉得很尴尬

将其拆分为多个文件,我可以使用常规方法处理这些文件

编辑:下面包含语法正确的记录示例,但有不同的类型。错误通常包括编码问题或无效字符。输入文件将在一行中包含数百万条记录。所需的输出将是一个包含数百万行的文件,每行都有一条记录

<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01052nam a2200301 i 4500</leader><controlfield tag="001">in1</controlfield><controlfield tag="005">20140804085243.0</controlfield><controlfield tag="008">760811s1975    enka     b    000 0 eng  </controlfield><datafield ind1=" " ind2=" " tag="010"><subfield code="a">76364258</subfield></datafield><datafield ind1=" " ind2=" " tag="020"><subfield code="a">0900492856 :</subfield><subfield code="c">£0.50</subfield></datafield><datafield ind1=" " ind2=" " tag="035"><subfield code="a">(OCoLC)02966998</subfield></datafield><datafield ind1=" " ind2=" " tag="040"><subfield code="a">DLC</subfield><subfield code="c">DLC</subfield><subfield code="d">MTH</subfield><subfield code="d">m.c</subfield><subfield code="d">UtOrBLW</subfield></datafield><datafield ind1="0" ind2="0" tag="050"><subfield code="a">U162</subfield><subfield code="b">.A3 no.116</subfield></datafield><datafield ind1="1" ind2=" " tag="100"><subfield code="a">Rosecrance, Richard N.</subfield></datafield><datafield ind1="1" ind2="0" tag="245"><subfield code="a">Strategic deterrence reconsidered /</subfield><subfield code="c">by Richard Rosecrance.</subfield></datafield><datafield ind1=" " ind2="1" tag="264"><subfield code="a">London :</subfield><subfield code="b">International Institute for Strategic Studies,</subfield><subfield code="c">1975.</subfield></datafield><datafield ind1=" " ind2=" " tag="300"><subfield code="a">3 unnumbered pages, 37 pages :</subfield><subfield code="b">illustrations ;</subfield><subfield code="c">25 cm.</subfield></datafield><datafield ind1=" " ind2=" " tag="336"><subfield code="a">text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield ind1=" " ind2=" " tag="337"><subfield code="a">unmediated</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield ind1=" " ind2=" " tag="338"><subfield code="a">volume</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield ind1="1" ind2=" " tag="490"><subfield code="a">Adelphi papers ;</subfield><subfield code="v">no. 116,</subfield><subfield code="x">0567-932X</subfield></datafield><datafield ind1=" " ind2=" " tag="500"><subfield code="a">Cover title.</subfield></datafield><datafield ind1=" " ind2=" " tag="504"><subfield code="a">Includes bibliographical references.</subfield></datafield><datafield ind1=" " ind2="0" tag="650"><subfield code="a">Deterrence (Strategy)</subfield></datafield><datafield ind1=" " ind2="0" tag="650"><subfield code="a">World politics</subfield><subfield code="y">1945-1989.</subfield></datafield><datafield ind1=" " ind2="0" tag="650"><subfield code="a">World politics</subfield><subfield code="y">1989-</subfield></datafield><datafield ind1="2" ind2=" " tag="710"><subfield code="a">International Institute for Strategic Studies.</subfield></datafield><datafield ind1=" " ind2="0" tag="830"><subfield code="a">Adelphi papers</subfield><subfield code="v">no. 116.</subfield></datafield><datafield ind1="f" ind2="f" tag="952"><subfield code="d">AC Frost Stacks - AFRST</subfield></datafield><datafield ind1=" " ind2=" " tag="998"><subfield code="a">AM</subfield><subfield code="b">000000123</subfield></datafield><datafield ind1="f" ind2="f" tag="999"><subfield code="i">5f1d0f2a-0297-444c-88e1-ef7ae9795107</subfield><subfield code="s">eaa917fe-9c6f-4e14-b87f-3dd72cab1384</subfield></datafield></record>

这样做的有效方法是什么


Tags: 文件方法tag记录语法codexmlrecord