执行文本预处理时Python出错

def cleaningDocs(doc,stem): # 'S' for Stemming, 'L' for Lemmatization """This function cleans each doc string by doing the following: i) Removing punctuation and other non alphabetical characters ii) Convert to Lower case and split string into words (tokenization) ii) Removes stop words (most frequent words) iii) Doing Stemming and Lemmatization """ # Removing punctuations and other non alphabetic characters import re alphabets_only=re.sub(r'[^a-zA-Z]'," ",doc) # Converting to lower case and splitting the words(tokenization) words_lower=alphabets_only.lower().split() # Removing stop words (Words like 'a', 'an', 'is','the' which doesn't contribute anything from nltk.corpus import stopwords useful_words = [w for w in words_lower if not w in set(stopwords.words("english"))] # Doing Stemming or Lemmatization (Normalising the text) from nltk.stem import PorterStemmer, WordNetLemmatizer if (stem=='S'): # Choosing between Stemming ('S') and Lemmatization ('L') stemmer=PorterStemmer() final_words=[stemmer.stem(x) for x in useful_words] else: lemma=WordNetLemmatizer() final_words=[lemma.lemmatize(x) for x in useful_words] return(str(" ".join(final_words)))

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-56-61345bb4d581> in <module>() 1 doc=[] 2 for x in docs: ----> 3 doc.append(cleaningDocs(x,"L")) 4 <ipython-input-42-6e1c58274c3d> in cleaningDocs(doc, stem) 13 # Removing punctuations and other non alphabetic characters 14 import re ---> 15 alphabets_only=re.sub(r'[^a-zA-Z]'," ",doc) 16 17 # Converting to lower case and splitting the words(tokenization) /Users/mtripathi/anaconda/lib/python2.7/re.pyc in sub(pattern, repl, string, count, flags) 153 a callable, it's passed the match object and must return 154 a replacement string to be used.""" --> 155 return _compile(pattern, flags).sub(repl, string, count) 156 157 def subn(pattern, repl, string, count=0, flags=0): TypeError: expected string or buffer

for x in docs: print(type(x)) <type 'str'> <type 'str'> <type 'str'> <type 'float'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'> <type 'str'>

1条回答

网友

1楼 · 发布于 2024-10-01 15:45:09

这里有两个主要思想：Boolean indexing和{a2}

布尔索引允许您使用真/假数组选择序列的子集，函数应用程序将单个函数应用于序列中的每个项。在

首先，应用isinstance来确定哪些元素是浮动的，然后对序列进行切片以获取元素。在

那么只要应用str就可以了。在

import pandas as pd

test = pd.Series(["Hey", "I'm", 1.0, "or", 2.0, "floats"])
# Find floats 
floats = test[test.apply(lambda x: isinstance(x, float))]
# Make all strings
test_as_strings = test.apply(str)

相关问题更多 >

编程相关推荐

热门问题

热门文章