除标点符号外的字数频率

, 88144 . 49109 dan 37283 di 33701 yang 29353 -LRB- 19843 -RRB- 19736 '' 15906 `` 15232 dengan 15231 pada 15021 dari 14900 tahun 13079 sebagai 9038 ini 8371 untuk 8297 dalam 8266 adalah 7950 menjadi 7414 oleh 5974

3条回答

网友

1楼 · 编辑于 2024-09-26 22:50:57

请尝试以下代码：

假设您对标点符号值不感兴趣。如果同时有标点符号和单词，则会被计算在内

dataset['token'][dataset['token'].str.contains(r'\w+')].value_counts()[:20]

网友

2楼 · 编辑于 2024-09-26 22:50:57

纯Python解决方案

从文本中删除除单词字符以外的所有内容
把所有的单词都给一个Counter
问20个最常见的问题

from collections import Counter

def get_most_common(text, n=20):
    processed_text = "".join(t for t in text.lower() if t.isalpha() or t == " ")
    counter = Counter(processed_text.split())
    return counter.most_common(n)

网友

3楼 · 编辑于 2024-09-26 22:50:57

你必须在计算之前做单词预处理。例如：

import pandas as pd
import re

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

def clean_text(text):
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    
    return text

ft = pd.read_json(path_of_file) # read ur file in pandas df
ft = ft.apply(clean_text)

祝你好运

相关问题更多 >

编程相关推荐

热门问题

热门文章

除标点符号外的字数频率

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >