用Python从法语Word文档中提取XML的问题：生成非法字符

import zipfile import os import tempfile import shutil def getXml(docxFilename): zip = zipfile.ZipFile(open(docxFilename,"rb")) xmlString= zip.read("word/document.xml").decode("utf-8") return xmlString def createNewDocx(originalDocx,xmlString,newFilename): tmpDir = tempfile.mkdtemp() zip = zipfile.ZipFile(open(originalDocx,"rb")) zip.extractall(tmpDir) with open(os.path.join(tmpDir,"word/document.xml"),"w") as f: f.write(xmlString) filenames = zip.namelist() zipCopyFilename = newFilename with zipfile.ZipFile(zipCopyFilename,"w") as docx: for filename in filenames: docx.write(os.path.join(tmpDir,filename),filename) shutil.rmtree(tmpDir)

1条回答

网友

1楼 · 发布于 2024-09-27 09:24:22

问题是您不小心更改了template2.docx中word/document.xml上的编码。word/document.xml（来自template.docx）最初编码为UTF-8（这是XML文档的默认编码）。在

xmlString = zip.read("word/document.xml").decode("utf-8")

但是，当您为template2.docx复制它时，您正在将编码更改为CP-1252。根据^{}的文档

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

您指出调用locale.getpreferredencoding(False)会给您cp1252，这是正在编写的编码word/document.xml。在

由于您没有显式地将<?xml version="1.0" encoding="cp1252"?>添加到word/document.xml的开头，Word（或任何其他XML读取器）将其读作UTF-8，而不是CP-1252，这就是导致非法XML字符错误的原因。在

因此，当使用encoding参数对open()进行写入时，您希望将编码指定为UTF-8：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章