lis中关键字的出现频率

3条回答

网友

1楼 · 编辑于 2024-09-27 07:36:16

我同意@bereal的观点，您应该使用Counter。我知道你说过你不想要“进口、dict或zips”，所以你可以忽略这个答案。然而，Python的一个主要优点是它的标准库，每次你有list可用时，你也会有dict、collections.Counter和{}。在

从您的代码中我得到的印象是，您希望使用与C或Java相同的样式。我建议你多做一点。以这种方式编写的代码可能看起来不太熟悉，而且需要时间来适应。不过，你会学到更多。在

你想要达到的目标会有所帮助。你在学Python吗？你在解决这个具体问题吗？你为什么不能用进口货，dict或zips？在

因此，这里有一个利用内置功能（没有第三方）的建议（使用Python2测试）：

#!/usr/bin/python

import re           # String matching
import collections  # collections.Counter basically solves your problem


def loadwords(s):
    """Find the words in a long string.

    Words are separated by whitespace. Typical signs are ignored.

    """
    return (s
            .replace(".", " ")
            .replace(",", " ")
            .replace("!", " ")
            .replace("?", " ")
            .lower()).split()


def loadwords_re(s):
    """Find the words in a long string.

    Words are separated by whitespace. Only characters and ' are allowed in strings.

    """
    return (re.sub(r"[^a-z']", " ", s.lower())
            .split())


# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")

# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))

# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)

# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1

# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
                                         for word in sourcefile_words
                                         if word in keywords)

count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']

网友

2楼 · 编辑于 2024-09-27 07:36:16

你可以尝试一下：

我以一个单词表为例。在

word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
    if word not in frequency_list:
        frequency_list[word] = 1
    else:
        frequency_list[word] += 1
print(frequency_list)

RESULT: {'test': 1, 'world': 1, 'hello': 2}

既然你对dicts施加了限制，我就利用两个列表来完成同样的任务。我不知道它的效率有多高，但它是有用的。在

^{pr2}$

你可以把它改成你喜欢的样子，或者根据你的意愿重新考虑它

网友

3楼 · 编辑于 2024-09-27 07:36:16

这是一个没有进口的解决方案。它使用嵌套的线性搜索，在一个小的输入数组上进行少量的搜索是可以接受的，但是当输入较大时会变得笨拙和缓慢。在

这里的输入仍然很大，但是它在合理的时间内处理它。我怀疑如果你的关键字文件更大（我的只有3个字）减速将开始显示。在

这里我们获取一个输入文件，遍历行并删除标点符号，然后按空格分割并将所有单词展平到一个列表中。列表中有重复项，因此要删除它们，我们对列表进行排序，使重复项聚集在一起，然后在列表上进行迭代，创建一个包含字符串和计数的新列表。我们可以通过增加计数来做到这一点，只要同一个单词出现在列表中，并在看到新单词时移动到新条目。在

现在，你有了你的词频列表，你可以在其中搜索所需的关键字并检索计数。在

输入的文本文件是here，关键字文件可以用文件中的几个单词拼凑在一起，每行一个。在

python3代码，它指示在适用的情况下如何为python2修改。在

# use string.punctuation if you are somehow allowed 
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')

words = []
with open('hamlet.txt') as f:
    for line in f:
        if line:
            line = line.translate(translator)
            # py 2 alternative
            #line = line.translate(None, string.punctuation)
            words.extend(line.strip().split())

# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()

thisword = ''
counts = []

# for each word in the list add to the count as long as the 
# word does not change
for w in words:
    if w != thisword:
        counts.append([w, 1])
        thisword = w
    else:
        counts[-1][1] += 1

for c in counts:
    print('%s (%d)' % (c[0], c[1]))

# function to prevent need to break out of nested loop
def findword(clist, word):
    for c in clist:
        if c[0] == word:
            return c[1]
    return 0   

# open keywords file and search for each word in the 
# frequency list.
with open('keywords.txt') as f2:
    for line in f2:
        if line:
            word = line.strip()
            thiscount = findword(counts, word)
            print('keyword %s appear %d times in source' % (word, thiscount))

如果您愿意，可以修改findword以使用二进制搜索，但它仍然不会接近dict。collections.Counter是没有限制的正确解决方案。它更快、更少的代码。在

相关问题更多 >

编程相关推荐

热门问题

热门文章