在pandas datafram中使用RegexpTokenizer

import re import nltk import pandas as pd from nltk import RegexpTokenizer #tokenization of data and suppression of None (NA) df['all_cols'].dropna(inplace=True) tokenizer = RegexpTokenizer("[\w']+") df['all_cols'] = df['all_cols'].apply(tokenizer)

2条回答

网友

1楼 · 编辑于 2024-10-17 16:35:44

首先要删除缺少的值，必须使用^{}和指定列名，然后使用tokenizer.tokenize方法，因为您的解决方案不会删除缺少的值：

df = pd.DataFrame({'all_cols':['who is your hero and why',
                               'what do you do to relax', 
                               "can't stop to eat", np.nan]})
print (df)
                   all_cols
0  who is your hero and why
1   what do you do to relax
2         can't stop to eat
3                       NaN

#solution remove missing values from Series, not rows from df
df['all_cols'].dropna(inplace=True)
print (df)
                   all_cols
0  who is your hero and why
1   what do you do to relax
2         can't stop to eat
3                       NaN

#solution correct remove rows by missing values
df.dropna(subset=['all_cols'], inplace=True)
print (df)
                   all_cols
0  who is your hero and why
1   what do you do to relax
2         can't stop to eat

tokenizer = RegexpTokenizer("[\w']+")
df['all_cols'] = df['all_cols'].apply(tokenizer.tokenize)
print (df)
                          all_cols
0  [who, is, your, hero, and, why]
1   [what, do, you, do, to, relax]
2           [can't, stop, to, eat]

网友

2楼 · 编辑于 2024-10-17 16:35:44

请注意，在调用RegexpTokenizer时，只需使用一组参数创建类的实例（调用其__init__方法）。为了用指定的模式实际标记dataframe列，必须调用其^{}方法：

tokenizer = RegexpTokenizer("[\w']+")
df['all_cols'] = df['all_cols'].map(tokenizer.tokenize)

       all_cols
0  [who, is, your, hero, and, why]
1   [what, do, you, do, to, relax]
...

相关问题更多 >

编程相关推荐

热门问题

热门文章

在pandas datafram中使用RegexpTokenizer

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >