用NLTK预处理数据帧中存储的语料库

import pandas as pd doc1 = """"Whitey on the Moon" is a 1970 spoken word poem by Gil Scott-Heron. It was released as the ninth track on Scott-Heron's debut album Small Talk at 125th and Lenox. It tells of medical debt and poverty experienced during the Apollo Moon landings. The poem critiques the resources spent on the space program while Black Americans were experiencing marginalization. "Whitey on the Moon" was prominently featured in the 2018 biographical film about Neil Armstrong, First Man.""" doc2 = """St Anselm's Church is a Roman Catholic church which is part of the Personal Ordinariate of Our Lady of Walsingham in Pembury, Kent, England. It was originally founded in the 1960s as a chapel-of-ease before becoming its own quasi-parish within the personal ordinariate in 2011, following a conversion of a large number of disaffected Anglicans in Royal Tunbridge Wells.""" doc3 = """Nymphargus grandisonae (common name: giant glass frog, red-spotted glassfrog) is a species of frog in the family Centrolenidae. It is found in Andes of Colombia and Ecuador. Its natural habitats are tropical moist montane forests (cloud forests); larvae develop in streams and still-water pools. Its habitat is threatened by habitat loss, introduced fish, and agricultural pollution, but it is still a common species not considered threatened by the IUCN.""" df = pd.DataFrame({'text': [doc1, doc2, doc3]})

+---+---------------------------------------------------+ | | text | +---+---------------------------------------------------+ | 0 | "Whitey on the Moon" is a 1970 spoken word poe... | +---+---------------------------------------------------+ | 1 | St Anselm's Church is a Roman Catholic church ... | +---+---------------------------------------------------+ | 2 | Nymphargus grandisonae (common name: giant gla... | +---+---------------------------------------------------+

import nltk import string from nltk.tokenize import sent_tokenize, word_tokenize nltk.download('punkt') from nltk.corpus import stopwords nltk.download('stopwords') stop_words = set(stopwords.words('english')) from nltk.stem import WordNetLemmatizer nltk.download('wordnet') df['tokenized_text'] = df['text'].apply(word_tokenize) df

+---+---------------------------------------------------+---------------------------------------------------+ | | text | tokenized_text | +---+---------------------------------------------------+---------------------------------------------------+ | 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... | +---+---------------------------------------------------+---------------------------------------------------+ | 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... | +---+---------------------------------------------------+---------------------------------------------------+ | 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, (, common, name, :, ... | +---+---------------------------------------------------+---------------------------------------------------+

+---+---------------------------------------------------+---------------------------------------------------+ | | text | tokenized_text | +---+---------------------------------------------------+---------------------------------------------------+ | 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... | +---+---------------------------------------------------+---------------------------------------------------+ | 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... | +---+---------------------------------------------------+---------------------------------------------------+ | 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, common, name, giant,... | +---+---------------------------------------------------+---------------------------------------------------+

1条回答

网友

1楼 · 发布于 2024-10-03 17:27:20

第一期

stop_words = set(stopwords.words('english'))和... if word not in [stop_words]：您创建了一个集合，其中只有一个元素——stopwords列表。Noword等于整个列表，因此不会删除停止字。所以它一定是：
stop_words = stopwords.words('english')
df['tokenized_text'].apply(lambda words: [word for word in words if word not in stop_words + list(string.punctuation)])

第二期

lemmatizer = WordNetLemmatizer在这里您分配了类，但是您需要创建这个类的对象：lemmatizer = WordNetLemmatizer()

第三期

你不能一次就把一个完整的列表列出来，你需要逐字列出来： df['tokenized_text'].apply(lambda words: [lemmatizer.lemmatize(word) for word in words])

相关问题更多 >

编程相关推荐

热门问题

热门文章