小写和小写+标题版本的特定字数不同

2024-10-03 19:30:54 发布

您现在位置:Python中文网/ 问答频道 /正文

很明显,我遗漏了一些简单的东西,我的猜测是,text1中还存在其他一些“whale”的外壳。比答案更重要的是,除了在text1和text1L中不区分大小写地搜索“whale”之外,如何有效地调试它?你知道吗

谢谢,我在NLTK的日子还早。你知道吗

import nltk
from nltk.book import *
text1L=[w.lower() for w in text1]
print(text1L.count('whale'))
>>>1226
print(text1.count('Whale')+text1.count('whale'))
>>>1188

Tags: 答案fromimportcount外壳区分printnltk
3条回答

您可以检查下面的代码,看看在text1中是否也有“WHALE”

>>> res = [j for j in (w for w in text1
               if all(i in w.lower() for i in 'whale')
               and len(w) == 5) if j not in ('Whale', 'whale')]
>>> len(res)  # 38 = 1226 - 1188
38
>>> 
>>> res
['WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE', 'WHALE']

所以,现在你有:

>>> [w.lower() for w in text1].count('whale')
1226
>>>
>>> text1.count('Whale') + text1.count('whale') + text1.count('WHALE')
1226

有一些鲸鱼。你知道吗

whale 906
Whale 282
WHALE 38

所以:

print(text1.count('Whale') + text1.count('whale') + text1.count('WHALE'))
>> 1226

为了弄清楚这一点,生成了单词“whale”的所有变体,并打印出非零计数的变体。你知道吗

产生变化:

def get_all_variations(word):
    if len(word) == 1:
        #a single character has two variations. e.g. a -> [a, A]
        return [word, word.upper()]
    else:
        #otherwise, call recursively using the left and the right half, and merge results.
        word_mid_point = len(word) // 2
        left_vars = get_all_variations(word[:word_mid_point])
        right_vars = get_all_variations(word[word_mid_point:])
        variations = []
        for left_var in left_vars:
            for right_var in right_vars:
                variations.append(left_var + right_var)
        return variations

然后:

whale_variations = get_all_variations("whale")
for whale_varitaion in whale_variations:
    count = text1.count(whale_varitaion)
    if count > 0:
        print(whale_varitaion, count)


作为旁注,所有的变化看起来都很整齐:

'whale,whalE,whaLe,whaLE,whAle,whAlE,whALe,whALE,wHale,wHalE,wHaLe,wHaLE,wHAle,wHAlE,wHALe,wHALE,Whale,WhalE,WhaLe,WhaLE,WhAle,WhAlE,WhALe,WhALE,WHale,WHalE,WHaLe,WHaLE,WHAle,WHAlE,WHALe,WHALE'

遍历nltk.Text对象返回一个字符串列表,每个字符串都是一个单词,如果对列表中的所有字符串应用相同的操作,那么使用map()可能是一个好主意。你知道吗

>>> from nltk.book import * 
>>> text1_lowered = list(map(str.lower, text1))
>>> text1_lowered.count('whale')
1226
>>> text1_lowered.count('Whale')
0
>>> text1.count('Whale') + text1.count('whale')
1188

为了解答其他“鲸鱼”从何而来的奥秘,我们得到1226条:

>>> from collections import Counter
>>> Counter([word for word in text1 if word.lower() == 'whale'])
Counter({'whale': 906, 'Whale': 282, 'WHALE': 38})

关于@axiom生成所有可能的“whale”大小写组合的想法,请参见String manipulation in Python (All upper and lower case derivatives of a word)

>>> from itertools import product

>>> cRaZySpe3K = lambda s: [''.join(x) for x in product(*[{c.upper(), c} for c in s.lower()])]

>>> cRaZySpe3K('whale')
['WHALe', 'WHALE', 'WHAle', 'WHAlE', 'WHaLe', 'WHaLE', 'WHale', 'WHalE', 'WhALe', 'WhALE', 'WhAle', 'WhAlE', 'WhaLe', 'WhaLE', 'Whale', 'WhalE', 'wHALe', 'wHALE', 'wHAle', 'wHAlE', 'wHaLe', 'wHaLE', 'wHale', 'wHalE', 'whALe', 'whALE', 'whAle', 'whAlE', 'whaLe', 'whaLE', 'whale', 'whalE']

>>> {whale:text1.count(whale) for whale in cRaZySpe3K('whale')}
{'WHALe': 0, 'WHALE': 38, 'WHAle': 0, 'WHAlE': 0, 'WHaLe': 0, 'WHaLE': 0, 'WHale': 0, 'WHalE': 0, 'WhALe': 0, 'WhALE': 0, 'WhAle': 0, 'WhAlE': 0, 'WhaLe': 0, 'WhaLE': 0, 'Whale': 282, 'WhalE': 0, 'wHALe': 0, 'wHALE': 0, 'wHAle': 0, 'wHAlE': 0, 'wHaLe': 0, 'wHaLE': 0, 'wHale': 0, 'wHalE': 0, 'whALe': 0, 'whALE': 0, 'whAle': 0, 'whAlE': 0, 'whaLe': 0, 'whaLE': 0, 'whale': 906, 'whalE': 0}

相关问题 更多 >