Off2Text:从Office文件中提取文本
off2txt的Python项目详细描述
从office文件中提取ascii/unicode文本以分离文件。
如果文档包含两种语言(如英语和 (中文)并且您希望将语言分为 进一步的处理和分析。
支持打开的XML文件格式。即docx、pptx和xlsx。
Word和PowerPoint文件被提取到文本文件中。Excel文件是 提取到csv文件,保留列。
可用于在不打开Excel的情况下从Excel生成CSV文件。
示例
从word文档中提取ascii和unicode文本
$ off2txt -s word.docx
上面将生成两个文件:word-ascii.txt和word-unicode.txt
从excel文档中提取ascii和unicode文本
$ off2txt -s excel.xlsx
以上将生成两个文件:excel-ascii.csv和excel-unicode.csv
注释
如果提取的文件为空,则不会创建该文件。
excel是不同的。保留列。所以可能会得到一个csv文件 空列。如果单元格 仅包含ascii,否则它们将流式传输到unicode文件。
用法
usage: off2txt [options] File [File ...] off2txt: extract ASCII/Unicode text from Office files to separate files positional arguments: File Files to extract from optional arguments: -h, --help show this help message and exit --version show program's version number and exit --debug Turn on debug logging. --debug-log FILE Save debug logging to FILE. -a EXTENSION, --ascii EXTENSION Identifier to append to input file name to make ASCII output file name when splitting Unicode and ASCII text. Default ascii. -d DIRECTORY, --directory DIRECTORY Save extracted text to DIRECTORY. Ignored if the -o option is given. -e EXTENSION, --extension EXTENSION Extension to use for extracted text files. Default for Word and PowerPoint is txt. Default for Excel is csv. -o FILE, --output FILE Save extracted text to FILE. If not given, the output file is named the same as the input file but with a txt extension. The extension can be changed with the -e option. Files are opened in append mode unless the -X option is given. -s, --split Split ASCII and Unicode text into two separate files. Unicode files are named by adding -unicode before the file extension. The Unicode identifer can be changed with the -u option. -u EXTENSION, --unicode EXTENSION Identifier to append to input file name to make Unicode output file name when splitting Unicode and ASCII text. Default unicode. -A, --suppress-file-access-errors Do not print file/directory access errors. -X, --overwrite-output-files Truncate output files before writing.