如何从docx文件中正确提取阿拉伯语/波斯语（rtl）文本 - 问答 - Python中文网

如何从docx文件中正确提取阿拉伯语/波斯语（rtl）文本

2024-10-02 14:28:27 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我试图从一些docx文件中提取大量的文本并将它们存储在.txt文件中。在

我使用的语言是波斯语/阿拉伯语（它们是从右到左的语言），所以我很难使用pythondocx。我无法以适当的形式提取文本，它们都被混合在.txt文件中。在

提取形式=https://pasteboard.co/Id8jj7g.jpg

原始形式=https://pasteboard.co/Id8jv1i.jpg

import docx

doc = docx.Document('1.docx')
text_file = open('data.txt','w', encoding='utf8')


print(len(doc.paragraphs))

for txt in doc.paragraphs:
    text_file.write(txt.text+'\n')

Tags：文件 text https 文本 txt 语言 doc 形式

1条回答

网友

1楼 · 发布于 2024-10-02 14:28:27

我需要先把形式定义得合适。如果你正在做一个NLP项目，你需要有句子和句子中的每个单词。我认为下面的代码对于从docx文件中提取文本很有帮助。（Python 2.7）

# library (using pip for installing the libraries)
import docxpy
import codecs

# read Input file : Input.docx
file = 'Input.docx'

# extract text from file 
text = docxpy.process(file)

# save the extracted text to a text file 
output_txt = codecs.open('Input.txt','w','utf-8')
output_txt.write(text)
output_txt.close()

有关详细信息，请阅读docxpy文档： docxpy website

相关问题更多 >

编程相关推荐

热门问题

热门文章