法语中最常用的词

2024-09-23 22:21:32 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在使用pythonnltk包查找法语文本中最常用的单词。我发现它不起作用。。。这是我的代码：

#-*- coding: utf-8 -*-

#nltk: package for text analysis
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import nltk
import tokenize
import codecs
import unicodedata


#output French accents correctly
def convert_accents(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')



### MAIN ###

#openfile
text_temp=codecs.open('text.txt','r','utf-8').readlines()

#put content in a list
text=[]
for word in text_temp:
    word=word.strip().lower()
    if word!="":
        text.append(convert_accents(word))

#tokenize the list
text=nltk.tokenize.word_tokenize(str(text))

#use FreqDist to get the most frequents words
fdist = FreqDist()
for word in  text:
    fdist.inc( word )
print "BEFORE removing meaningless words"
print fdist.items()[:10]

#use stopwords to remove articles and other meaningless words
for sw in stopwords.words("french"):
     if fdist.has_key(sw):
          fdist.pop(sw)
print "AFTER removing meaningless words"
print fdist.items()[:10]

输出如下：

BEFORE removing meaningless words
[(',', 85), ('"', 64), ('de', 59), ('la', 47), ('a', 45), ('et', 40), ('qui', 39), ('que', 33), ('les', 30), ('je', 24)]
AFTER removing meaningless words
[(',', 85), ('"', 64), ('a', 45), ('les', 30), ('parce', 15), ('veut', 14), ('exigence', 12), ('aussi', 11), ('pense', 11), ('france', 10)]

我的问题是stopwords没有丢弃所有无意义的单词。例如，“”不是一个单词，应该删除，'les'是一篇文章，应该删除。

如何解决这个问题？

我使用的文本可以在以下页面找到： http://www.elysee.fr/la-presidence/discours-d-investiture-de-nicolas-sarkozy/

Tags： text in import for 单词 word words print

0条回答

目前没有回答

法语中最常用的词

相关问题更多 >

编程相关推荐

热门问题

热门文章

法语中最常用的词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >