用Sastrawi词干为印尼语单词 - 问答 - Python中文网

用Sastrawi词干为印尼语单词

2024-07-02 12:08:20 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我有一个csv数据集，这个数据的值就在这里 enter image description here

所以，我想对数据进行预处理。数据类型是文本，所以我将文本挖掘。但我不知道该怎么做。我试着把数据截取出来，结果是所有新闻的字数。我从朋友那里得到代码参考，但我想改变。我想更改代码以提高结果。我希望结果是统计每一条新闻的字数，而不是把所有的新闻分开。请帮我修改代码。在

代码如下：

import os import pandas as pd from pandas import DataFrame, read_csv data = r'D:/SKRIPSI/sample_200_data.csv' df = pd.read_csv(data) print "DF", type (df['content']), "\n", df['content'] isiberita = df['content'].tolist() print "DF list isiberita ", isiberita, type(isiberita) df.head() --------------------------------------------------------- import nltk import string import os import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from Sastrawi.Stemmer.StemmerFactory import StemmerFactory from nltk.corpus import stopwords from collections import Counter path = 'D:/SKRIPSI/sample_200_data.csv' token_dict = {} factory = StemmerFactory() stemmer = factory.create_stemmer() content_stemmed = map(lambda x: stemmer.stem(x), isiberita) content_no_punc = map(lambda x: x.lower().translate(None, string.punctuation), content_stemmed) content_final = [] for news in content_no_punc: word_token = nltk.word_tokenize(news) # get word token for every news (split news into each separate words) word_token = [word for word in word_token if not word in nltk.corpus.stopwords.words('indonesian') and not word[0].isdigit()] # remove indonesian stop words and number content_final.append(" ".join(word_token)) counter = Counter() # counter initiate [counter.update(news.split()) for news in content_final] # we split every news to get counter of each words print(counter.most_common(100))

所以这个代码的结果是：

我希望任何人都能帮助我更改代码，这样我就可以得到结果“每个新闻（内容）的字数，而不是所有新闻中的全部字数”。谢谢你。在

Tags： csv 数据代码 from import token df data

1条回答

网友

1楼 · 发布于 2024-07-02 12:08:20

如果我理解正确，那么您的问题与PySastrawi没有直接关系。在

问题是在处理新闻数据时使用counter.update()。最后，这将返回所有新闻的累计字数。如果要单独计算来自单个新闻的单词数，则需要为每个新闻单独提供一个Counter实例。如下所示（这将从每条新闻中打印出100个最常见的单词）：

for news in content_final:
    counter = Counter(news.split()) # counter initiate
    print(counter.most_common(100))

完整的演示示例：

^{pr2}$

现场直播：https://eval.in/664688

相关问题更多 >

编程相关推荐

热门问题

热门文章