如何从文件读取的字符串中计算单词数？

import os import collections vocab = set() path = 'a\\path\\' listing = os.listdir(path) unwanted_chars = ".,-_/()*" vocab={} for file in listing: #print('Current file : ', file) pos_review = open(path+file, "r", encoding ='utf8') words = pos_review.read().split() #print(type(words)) vocab.update(words) pos_review.close() print(vocab) pos_dict = dict.fromkeys(vocab,0) print(pos_dict)

3条回答

网友
1楼 · 编辑于 2024-05-20 14:17:19

这同样有效
import pandas as pd import glob.glob files = glob.glob('test*.txt') txts = [] for f in files: with open (f,'r') as t: txt = t.read() txts.append(txt) texts=' '.join(txts) df = pd.DataFrame({'words':texts.split()}) out = df.words.value_counts().to_dict()

网友
2楼 · 编辑于 2024-05-20 14:17:19

希望这有帮助
import os import collections vocab = set() path = 'a\\path\\' listing = os.listdir(path) unwanted_chars = ".,-_/()*" vocab={} whole=[] for file in listing: #print('Current file : ', file) pos_review = open(path+file, "r", encoding ='utf8') words = pos_review.read().split() whole.extend(words) pos_review.close() print(vocab) d={} #Creating an Empty dictionary for item in whole: if item in d.keys(): d[item]+=1 #Update count else: d[item]=1 print(d)

网友
3楼 · 编辑于 2024-05-20 14:17:19

使用^{}：

Counter是用于计算iterables的dict子类

数据

给定3个文件，名为t1.txt，t2.txt&；t3.txt
每个文件包含以下3行文本

file1 txt A quick brown fox.
file2 txt a quick boy ran.
file3 txt fox ran away.

代码：

获取文件：

pathlib

from pathlib import Path

files = list(Path('e:/PythonProjects/stack_overflow/t-files').glob('t*.txt'))
print(files)

# Output
[WindowsPath('e:/PythonProjects/stack_overflow/t-files/t1.txt'),
 WindowsPath('e:/PythonProjects/stack_overflow/t-files/t2.txt'),
 WindowsPath('e:/PythonProjects/stack_overflow/t-files/t3.txt')]

收集字数：

创建一个单独的函数clean_str，用于清理每一行文本
^{}表示小写字母
^{}、^{}&；^{}用于高度优化的标点符号删除
从Best way to strip punctuation from a string

from collections import Counter
import string

def clean_string(value: str) -> list:
    value = value.lower()
    value = value.translate(str.maketrans('', '', string.punctuation))
    value = value.split()
    return value

words = Counter()
for file in files:
    with file.open('r') as f:
        lines = f.readlines()
        for line in lines:
            line = clean_string(line)
            words.update(line)

print(words)

# Output
Counter({'file1': 3,
         'txt': 9,
         'a': 6,
         'quick': 6,
         'brown': 3,
         'fox': 6,
         'file2': 3,
         'boy': 3,
         'ran': 6,
         'file3': 3,
         'away': 3})

列表`words`：

list_words = list(words.keys())
print(list_words)

>>> ['file1', 'txt', 'a', 'quick', 'brown', 'fox', 'file2', 'boy', 'ran', 'file3', 'away']

使用^{}：

数据

代码：

获取文件：

收集字数：

列表`words`：

相关问题更多 >

编程相关推荐

热门问题

热门文章