文本fi中的单词集

import string filecontent = [] word_set = {} with open ("small.txt") as myFile: for line in myFile: line = line.rstrip() line = line.replace("\t","") for character in line: if character in string.digits or character in string.punctuation: line = line.replace(character, "") if line != "": filecontent.append(line) lowerCase = [x.lower() for x in filecontent] word_set = {word for line in lowerCase for word in line.split()}

3条回答

网友

1楼 · 编辑于 2024-05-03 04:26:21

你可以这样做：

>>> from string import punctuation
>>> def solve(s):
        for line in s.splitlines():
            for word in line.split():
                word = word.strip(punctuation)
                if word.translate(None, punctuation).isalpha():
                    yield word
...                 
>>> s = '''*eBooks$ Readable By Both Humans and By Computers, Since 1971**

*These# eBooks@ Were Prepared By Thousands of Volunteers!'''
>>> set(solve(s))
set(['and', 'Both', 'Since', 'These', 'Readable', 'Computers', 'Humans', 'Prepared', 'of', 'Were', 'Volunteers', 'Thousands', 'By', 'eBooks'])

如果您使用的是python3，那么您需要将str.translate部分替换为：

^{pr2}$

网友

2楼 · 编辑于 2024-05-03 04:26:21

下面是一个使用reregex模块的解决方案。它还提供了一个字数，但如果你不想，你可以只使用这些键，或换成一组。在

text = """*eBooks$ Readable By Both Humans and By Computers, Since 1971**

*These# eBooks@ Were Prepared By Thousands of Volunteers!"""

import re

from collections import Counter

words = Counter()

regex = re.compile(r"[a-zA-Z]+")

matches = regex.findall(text)
for match in matches:
  words[match.lower()] += 1

print words

或者，如果你有一个文件

^{pr2}$

这给了

Counter({'by': 3, 'ebooks': 2, 'and': 1, 'both': 1, 'since': 1, 'these': 1, 'readable': 1, 'computers': 1, 'humans': 1, '1971': 1, 'prepared': 1, 'of': 1, 'were': 1, 'volunteers': 1, 'thousands': 1})

网友

3楼 · 编辑于 2024-05-03 04:26:21

如果我是你，我已经习惯了关于芬德尔在

import re
s = '''*eBooks$ Readable By Both Humans and By Computers, Since 1971**
*These# eBooks@ Were Prepared By Thousands of Volunteers!'''
set(re.findall('[a-zA-Z]+',s))

输出

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章