如何从文本文件中读取非英语语言的文本并用python打印它？

f = open('C:/python programs/hafez.txt') wordDict ={} for line in f: wordList = line.strip().split(' ') for word in wordList: if word not in wordDict: wordDict[word] = 1 else: wordDict[word] = wordDict[word]+1 print((str(wordDict)))

3条回答

网友

1楼 · 编辑于 2024-09-30 02:20:38

一般来说，您可以用UTF-8编码txt文件，并在py文件的开始部分用#--coding:UTF-8-在py文件中读取UTF-8。在

网友

2楼 · 编辑于 2024-09-30 02:20:38

有几种方法可以解决这个问题，但最简单的方法可能是codecs.open()。（我假设您使用Python2.7来实现Counter和with的一些其他技巧）。在

import codecs
from collections import Counter
wordDict = Counter()

with codecs.open('C:/python programs/hafez.txt','r',encoding='cp720') as f:
    for line in f:
        wordDict.update(line.strip().split())

for word, count in wordDict.most_common(): 
    print word, count

在Python3中，您需要带有print的括号（在Python3中它是一个函数，但在Python2中是一个语句），并且您不需要导入codecs，因为内置的open()支持不同的编码。在

如果您的编码不是代码页720，那么您需要将该选项替换为相应编码的缩写。在

这是一个学习编码的好机会。虽然我同意Joel，that no programmer should pretend that we live in a US English / ASCII world，但是当你经常处理一个非拉丁字母时，编码的问题就变得特别相关了。（除此之外，ASCII对英语来说甚至还不够——许多英语单词是借用来保持重音的，还有其他问题。）好的起点是Joel的文章（The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)）、Pragmatic Unicode（包括Unicode sandwich），以及为了便于在python2中生成said sandwich，^{}模块。Python文档中还有一个HOWTO，在阅读了其他文章之后，它更容易理解。在

如果您决定使用完整的python3，那么您可以从文档页面顶部的列表框中简单地选择您的确切版本。BDFL的summary of the differences between Python 2 and 3还包含一些关于Unicode和how it's handled differently in Python 2 and 3问题的信息。在

网友

3楼 · 编辑于 2024-09-30 02:20:38

考虑使用pythonCounter子类来计算单词的出现次数。在

至于文本，python2.7默认情况下不是unicode。阅读：http://docs.python.org/2/howto/unicode.html

你可以用

for i,j in wordDict.iteritems():
    print unicode(i),j

相关问题更多 >

编程相关推荐

热门问题

热门文章