有人能说出remove\u punct\u dict命令在做什么吗？？最后一行命令的输出是什么？

def LemTokens(tokens): return [lemmer.lemmatize(token) for token in tokens] remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation) def LemNormalize(text): return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

1条回答

网友

1楼 · 发布于 2024-10-04 05:20:35

remove_punct_dict实际上是字符串中所有标点符号的Unicode值的dict集合。标点符号的值为None

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

它简单地分解为：为字符串中的每个标点创建一个dict(ord(punct),None)，其中ord是python中的内置函数，用于返回对应字符的Unicode值

让我们回顾最后一个函数：

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

该方法首先将给定的文本设置为小写，然后使用reference删除文本中的标点，以删除我们前面创建的单词

例如，因此Hello World变成hello world!，然后hello world

然后它继续标记单词，因此现在我们没有了Hello World，而是hello和world

最后一个功能是将单词词干转换为最简单的形式。您可以阅读有关词干here的更多信息hello和world已使用波特词干分析器对单词进行词干分析，因此将保持不变。因此，我的示例的最终输出非常简单

hello和world

例如：

import string
import nltk;

text = "Hello World! My name is bob and i own a dog, a cat and a chicken."
lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

print(LemNormalize(text)) # ['hello', 'world', 'my', 'name', 'is', 'bob', 'and', 'i', 'own', 'a', 'dog', 'a', 'cat', 'and', 'a', 'chicken']

相关问题更多 >

编程相关推荐

热门问题

热门文章