如何有效地计算字符串别名?

2024-09-30 06:29:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我在做一个个人项目,统计文本中提到的名字的实例。我知道我可以用collections.Counter()来做,但我不知道如何有效地解释别名。你知道吗

比如说我想数一数的一个名字是"Tim",但我也想数一数他有什么昵称,比如"Timmy""Timster"。你知道吗

我有一些字符串说,"Oh Tim is going to the party?""Yeah, my boy Timmy, wouldn't miss it, he loves to party!""Whoa, the Timster himself is going? Count me in!"

我想把它都算作一个变量,比如"Tim"。我知道我可以简单地把它们一一数出来,然后再加起来。但我觉得有更好的办法。你知道吗

我希望我的代码看起来更像。你知道吗

names = {
    'Tim':{'Tim', 'Timmy', 'Timster'},
    ... other names here.}
# add any occurrence of Tim names to Tim and other occurrences of other names to their main name.

而不是像

total_tim = Counter(tim) + Counter(timmy) + Counter(timster), etc..

每一个名字。有人知道我该怎么做吗?你知道吗


Tags: oftheto项目namesispartycounter
3条回答

使用regex将有助于解决这个问题。你知道吗

import re
your_dict = {"Tim":["Tim","Timmy","Timster"]}
s = "Oh Tim is going to the party? Yeah, my boy Timmy, wouldn't miss it, he loves to party! Whoa, the Timster himself is going? Count me in!"
for each in your_dict:
    print(each,"count = ", len(re.findall("|".join(sorted(your_dict[each],reverse=True)),s)))

如果要忽略大小写,则只需在re.findall中使用re.IGNORECASE参数

from collections import Counter

TEXT = '''
    Blah Tim blah blah Timmy blah Timster blah Tim
    Blah Bill blah blah William blah Billy blah Bill Bill
'''
words = TEXT.split()

# Base names a their aliases.
ALIASES = dict(
    Tim = {'Tim', 'Timmy', 'Timster'},
    Bill = {'Bill', 'William', 'Billy'},
)

# Given any name, find its base name.
BASE_NAMES = {a : nm for nm, aliases in ALIASES.items() for a in aliases}

# All names.
ALL_NAMES = set(nm for aliases in ALIASES.values() for nm in aliases)

# Count up all names.
detailed_tallies = Counter(w for w in words if w in ALL_NAMES)

# Then build the summary counts from those details.
summary_tallies = Counter()
for nm, n in detailed_tallies.items():
    summary_tallies[BASE_NAMES[nm]] += n

print(detailed_tallies)
print(summary_tallies)

# Counter({'Bill': 3, 'Tim': 2, 'Timmy': 1, 'Timster': 1, 'William': 1, 'Billy': 1})
# Counter({'Bill': 5, 'Tim': 4})

这里有一个使用regex的非常简单的解决方案。你知道吗

这个解决方案的好处是不必显式地命名变体。如果你知道那个人名字的开头,你应该很好。你知道吗

from collections import Counter
import re

TEXT = '''
    Blah Tim blah blah Timmy blah Timster blah Tim
    Blah Bill blah blah William blah Billy blah Bill Bill
'''

tim_search = '(Tim([a-z]*)?(?=\ ?))'
bill_search = '((B|W)ill([a-z]*)?(?=\ ?))'
def name_counter(regex_string): 
   return Counter([i for i, *j in re.findall(regex_string, TEXT)])

name_counter(tim_search)
Counter({'Tim': 2, 'Timmy': 1, 'Timster': 1})

name_counter(bill search)
Counter({'Bill': 3, 'Billy': 1, 'William': 1})

相关问题 更多 >

    热门问题