用NLTK预处理数据帧中存储的语料库

2024-10-03 17:27:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在学习NLP,并试图了解如何对存储在熊猫数据帧中的语料库执行预处理。 假设我有这个:

import pandas as pd

doc1 = """"Whitey on the Moon" is a 1970 spoken word poem by Gil Scott-Heron. It was released as the ninth track on Scott-Heron's debut album Small Talk at 125th and Lenox. It tells of medical debt and poverty experienced during the Apollo Moon landings. The poem critiques the resources spent on the space program while Black Americans were experiencing marginalization. "Whitey on the Moon" was prominently featured in the 2018 biographical film about Neil Armstrong, First Man."""
doc2 = """St Anselm's Church is a Roman Catholic church which is part of the Personal Ordinariate of Our Lady of Walsingham in Pembury, Kent, England. It was originally founded in the 1960s as a chapel-of-ease before becoming its own quasi-parish within the personal ordinariate in 2011, following a conversion of a large number of disaffected Anglicans in Royal Tunbridge Wells."""
doc3 = """Nymphargus grandisonae (common name: giant glass frog, red-spotted glassfrog) is a species of frog in the family Centrolenidae. It is found in Andes of Colombia and Ecuador. Its natural habitats are tropical moist montane forests (cloud forests); larvae develop in streams and still-water pools. Its habitat is threatened by habitat loss, introduced fish, and agricultural pollution, but it is still a common species not considered threatened by the IUCN."""

df = pd.DataFrame({'text': [doc1, doc2, doc3]})

其结果是:

+---+---------------------------------------------------+
|   |                                              text |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+

现在,我加载所需内容并标记文本:

import nltk
import string
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

df['tokenized_text'] = df['text'].apply(word_tokenize)
df

这将提供以下输出:

+---+---------------------------------------------------+---------------------------------------------------+
|   |                                              text |                                    tokenized_text |
+---+---------------------------------------------------+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... |
+---+---------------------------------------------------+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... |
+---+---------------------------------------------------+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, (, common, name, :, ... |
+---+---------------------------------------------------+---------------------------------------------------+

现在,删除停止字时出现问题:

df['tokenized_text'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not  in [stop_words] + list(string.punctuation)])

看起来什么都没发生:

+---+---------------------------------------------------+---------------------------------------------------+
|   |                                              text |                                    tokenized_text |
+---+---------------------------------------------------+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... |
+---+---------------------------------------------------+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... |
+---+---------------------------------------------------+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, common, name, giant,... |
+---+---------------------------------------------------+---------------------------------------------------+

有人能帮我理解发生了什么,我应该做什么吗

之后,我想应用柠檬化,但在当前状态下不起作用:

lemmatizer = WordNetLemmatizer
df['tokenized_text'] = df['tokenized_text'].apply(lemmatizer.lemmatize)

收益率:

TypeError: lemmatize() missing 1 required positional argument: 'word'

谢谢


Tags: ofthetextinimportdfison
1条回答
网友
1楼 · 发布于 2024-10-03 17:27:20
第一期

stop_words = set(stopwords.words('english'))... if word not in [stop_words]:您创建了一个集合,其中只有一个元素——stopwords列表。Noword等于整个列表,因此不会删除停止字。所以它一定是:
stop_words = stopwords.words('english')
df['tokenized_text'].apply(lambda words: [word for word in words if word not in stop_words + list(string.punctuation)])

第二期

lemmatizer = WordNetLemmatizer在这里您分配了类,但是您需要创建这个类的对象:lemmatizer = WordNetLemmatizer()

第三期

你不能一次就把一个完整的列表列出来,你需要逐字列出来: df['tokenized_text'].apply(lambda words: [lemmatizer.lemmatize(word) for word in words])

相关问题 更多 >