标记化,包括收缩,柠檬化和茎。
tokenizer-xm的Python项目详细描述
简介
这个包是我发现对文本预处理有用的几个包的集合,包括gensim和ntlk。我把它们放在一起,创建了一个更全面、更方便的管道。在
安装
pip install tokenizer_xm
使用
处理单个文本字符串
^{pr2}$print("Original text:")print(example_text)print("---")print("Simple Preprocessed:")print("---")tk=text_tokenizer_xm(text=example_text,lemma_flag=False,stem_flag=False,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with regular contractions (e.g. I've -> I have):")# In this package, I included a dictionary of regular contractions for your conveniencetk=text_tokenizer_xm(text=example_text,lemma_flag=False,stem_flag=False, \ contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with lemmatization:")tk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=False, \ contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with lemmatization and stemming:")# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable latertk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=True, \ contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Adding stop words")# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable latertk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=True, \ contractions=contractions,stopwords=["this",'be',"an",'it'])print(tk.txt_pre_pros())print("---")
Original text:
This is an amazing product! I've been using it for almost a year now and it's clearly better than any other products I've used.
---
Simple Preprocessed:
---
['this', 'is', 'an', 'amazing', 'product', 've', 'been', 'using', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'clearly', 'better', 'than', 'any', 'other', 'products', 've', 'used']
---
Pre-processing with regular contractions (e.g. I've -> I have):
['this', 'is', 'an', 'amazing', 'product', 'have', 'been', 'using', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'has', 'it', 'is', 'clearly', 'better', 'than', 'any', 'other', 'products', 'have', 'used']
---
Pre-processing with lemmatization:
['this', 'be', 'an', 'amaze', 'product', 'have', 'be', 'use', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clearly', 'better', 'than', 'any', 'other', 'product', 'have', 'use']
---
Pre-processing with lemmatization and stemming:
['this', 'be', 'an', 'amaz', 'product', 'have', 'be', 'use', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clear', 'better', 'than', 'ani', 'other', 'product', 'have', 'use']
---
Adding stop words
['amaz', 'product', 'have', 'use', 'for', 'almost', 'year', 'now', 'and', 'have', 'clear', 'better', 'than', 'ani', 'other', 'product', 'have', 'use']
---
处理文本列表
text_list=['I am ready',"This is great","I love it"]tk=text_tokenizer_xm(text=text_list,lemma_flag=True,stem_flag=True, \ contractions=contractions,stopwords=[])# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros_all())print("---")
0 [be, readi]
1 [this, be, great]
2 [love, it]
dtype: object
---
停止词删除和词干化/词干化的顺序
当前的算法在停止词删除之前执行词干化和词干化。因此
- 在
定义停止词列表时需要小心。例如,如果将stem_标志设置为True,则包含术语“product”也将删除术语“production”;如果将lemma_flag设置为True,则包含术语“productions”。在
在 - 在
当引理标记设为True时,像“is”和“are”这样的词将被词形化为“be”。如果“be”不在停止语列表中,它将保留。如果决定执行柠檬化,建议您也处理停止词列表
在
"""Example"""text="products, production, is"stop_words=['product','is']tk=text_tokenizer_xm(text=text,lemma_flag=False,stem_flag=False, \ contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())
['products', 'production']
tk=text_tokenizer_xm(text=text,lemma_flag=True,stem_flag=False, \ contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())
['production', 'be']
tk=text_tokenizer_xm(text=text,lemma_flag=True,stem_flag=True, \ contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())
['be']
- 项目
标签: