回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我正在编写一个程序来检查一堆.docx文件中是否存在一个单词(我们谈论的大约是2500个.docx文件)</p>
<p>下面是代码中有趣的部分:</p>
<pre><code>for filename in directorylist:
if filename.endswith(".docx"):
i = Document(filename)
print(filename)
for destination in destinationlist:
for paragraph in i.paragraphs:
if destination in paragraph.text:
destinationcount[destination] = 1
break
else:
destinationcount[destination] = 0
continue
for destination in destinationcount:
destinationcountnobool[destination] += destinationcount[destination]
else:
continue
</code></pre>
<p>现在,我知道你在想什么了,一般来说,这是一个非常混乱的循环和糟糕的编程,但这是一个快速而肮脏的工作,所以饶了我吧</p>
<p>下面是我得到的错误:</p>
<pre><code>Traceback (most recent call last):
File "ICrunchMeSomeFiles.py", line 27, in <module>
i = Document(filename)
File "C:\Users\User\Anaconda3\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\User\Anaconda3\lib\site-packages\docx\opc\package.py", line 130, in open
Unmarshaller.unmarshal(pkg_reader, package, PartFactory)
File "C:\Users\User\Anaconda3\lib\site-packages\docx\opc\package.py", line 199, in unmarshal
pkg_reader, package, part_factory
File "C:\Users\User\Anaconda3\lib\site-packages\docx\opc\package.py", line 216, in _unmarshal_parts
partname, content_type, reltype, blob, package
File "C:\Users\User\Anaconda3\lib\site-packages\docx\opc\part.py", line 191, in __new__
return PartClass.load(partname, content_type, blob, package)
File "C:\Users\User\Anaconda3\lib\site-packages\docx\opc\part.py", line 231, in load
element = parse_xml(blob)
File "C:\Users\User\Anaconda3\lib\site-packages\docx\oxml\__init__.py", line 28, in parse_xml
root_element = etree.fromstring(xml, oxml_parser)
File "src\lxml\etree.pyx", line 3236, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1764, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: AttValue length too long, line 2, column 11011745
</code></pre>
<p>该程序适用于较小的样本,因此我假设这是内存问题。非常感谢您的帮助</p>
<p>编辑:应该早一点这样做,但是已经用导致错误的整个代码片段更新了帖子</p>
<pre><code>import csv
from docx import Document
from collections import Counter
import os
directorylist = os.listdir(os.getcwd()) # Set directory here
destinationcount = Counter()
destinationcountnobool = Counter()
destinationlist = ["test1", "test2", "test3", "test4", "test5"]
print(directorylist)
for filename in directorylist:
if filename.endswith(".docx"):
i = Document(filename)
for destination in destinationlist:
for paragraph in i.paragraphs:
if destination in paragraph.text:
destinationcount[destination] = 1
break
else:
destinationcount[destination] = 0
continue
for destination in destinationcount:
destinationcountnobool[destination] += destinationcount[destination]
else:
continue
for d in destinationcountnobool:
print(d + " : " + str(destinationcountnobool[d]))
</code></pre>
<p><strong>更新:我研究这个问题已经有一段时间了……在遇到同样的错误之前,python似乎只能处理118个文件。</strong></p>
<p><strong>更新:解决了!我想有点……我已经发布了我的答案</strong></p>