滥用nltk单词“标记化”的后果(发送)

2024-05-20 03:14:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图把一段话分解成文字。我有可爱的nltk.tokenize.word_标记化(发送)在手边,但是help(word_tokenize)说,“这个标记器设计成一次处理一个句子。”

有人知道如果你把它用在一个段落上会发生什么,比如最多5个句子?我自己也试过几小段,似乎很管用,但这并不是确凿的证据。在


Tags: 标记证据help句子word段落文字tokenize
2条回答

nltk.tokenize.word_tokenize(text)只是一个薄的wrapper function,它调用TreebankWordTokenizer类实例的tokenize方法,显然它使用简单的regex来解析一个句子。在

该类的文档说明:

This tokenizer assumes that the text has already been segmented into sentences. Any periods apart from those at the end of a string are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.

底层的^{}方法本身非常简单:

def tokenize(self, text):
    for regexp in self.CONTRACTIONS2:
        text = regexp.sub(r'\1 \2', text)
    for regexp in self.CONTRACTIONS3:
        text = regexp.sub(r'\1 \2 \3', text)

    # Separate most punctuation
    text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)

    # Separate single quotes if they're followed by a space.
    text = re.sub(r"('\s)", r' \1', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)

    return text.split()

基本上,如果句点落在字符串的末尾,该方法通常会将其标记为单独的标记:

^{pr2}$

字符串中的任何句点都被标记为单词的一部分,假设它是一个缩写:

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

只要这种行为是可以接受的,你就应该没事。在

试试这种方法:

>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
...     if ch in punct:
...             sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())

那么最有可能下面的代码就是您需要计算频率的代码=)

^{pr2}$

相关问题 更多 >