从文本中提取困难的英语单词,使用Python或JavaScript进行词汇建设

2024-09-25 00:22:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在网上从古腾堡的英文文本中找出一些难懂的单词,以便使用python或javascript构建词汇。我不喜欢简单的词,而是独特的词汇,比如帝王,伪经……等等

如何确保当我拆分文本时,我只得到唯一的词汇,而不是简单的单词。在


Tags: 文本javascript单词词汇伪经
3条回答

正如@Hoog建议的,这里是伪代码:

simple_words = [...]
difficult_words = [word for word in english_vocabulary if word not in simple_words]

我把“非常用词”定义为不出现在前10000个最常见的英语单词中的单词。在

10K最常见的单词是任意边界,但正如the github repo所述:

According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.

import requests

english_most_common_10k = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa-no-swears.txt'

# Get the file of 10 k most common words from TXT file in a github repo
response = requests.get(english_most_common_10k)
data = response.text

set_of_common_words = {x for x in data.split('\n')}

# Once we have the set of common words, we can just check.
# The check is in average case O(1) operation,
# but you can use for example some sort of search three with O(log(n)) complexity
while True:
    word = input()
    if word in set_of_common_words:
        print(f'The word "{word}" is common')
    else:
        print(f'The word "{word}" is difficult')

您还可以使用pop()从英语词典中删除最难的单词列表。在

相关问题 更多 >