标记化,包括收缩,柠檬化和茎。

tokenizer-xm的Python项目详细描述


简介

这个包是我发现对文本预处理有用的几个包的集合,包括gensim和ntlk。我把它们放在一起,创建了一个更全面、更方便的管道。在

安装

pip install tokenizer_xm

使用

处理单个文本字符串

^{pr2}$
print("Original text:")print(example_text)print("---")print("Simple Preprocessed:")print("---")tk=text_tokenizer_xm(text=example_text,lemma_flag=False,stem_flag=False,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with regular contractions (e.g. I've -> I have):")# In this package, I included a dictionary of regular contractions for your conveniencetk=text_tokenizer_xm(text=example_text,lemma_flag=False,stem_flag=False, \
                       contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with lemmatization:")tk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=False, \
                       contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with lemmatization and stemming:")# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable latertk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Adding stop words")# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable latertk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=["this",'be',"an",'it'])print(tk.txt_pre_pros())print("---")
Original text:
This is an amazing product! I've been using it for almost a year now and it's clearly better than any other products I've used.
---
Simple Preprocessed:
---
['this', 'is', 'an', 'amazing', 'product', 've', 'been', 'using', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'clearly', 'better', 'than', 'any', 'other', 'products', 've', 'used']
---
Pre-processing with regular contractions (e.g. I've -> I have):
['this', 'is', 'an', 'amazing', 'product', 'have', 'been', 'using', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'has', 'it', 'is', 'clearly', 'better', 'than', 'any', 'other', 'products', 'have', 'used']
---
Pre-processing with lemmatization:
['this', 'be', 'an', 'amaze', 'product', 'have', 'be', 'use', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clearly', 'better', 'than', 'any', 'other', 'product', 'have', 'use']
---
Pre-processing with lemmatization and stemming:
['this', 'be', 'an', 'amaz', 'product', 'have', 'be', 'use', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clear', 'better', 'than', 'ani', 'other', 'product', 'have', 'use']
---
Adding stop words
['amaz', 'product', 'have', 'use', 'for', 'almost', 'year', 'now', 'and', 'have', 'clear', 'better', 'than', 'ani', 'other', 'product', 'have', 'use']
---

处理文本列表

text_list=['I am ready',"This is great","I love it"]tk=text_tokenizer_xm(text=text_list,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=[])# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros_all())print("---")
0          [be, readi]
1    [this, be, great]
2           [love, it]
dtype: object
---

停止词删除和词干化/词干化的顺序

当前的算法在停止词删除之前执行词干化和词干化。因此

  1. 定义停止词列表时需要小心。例如,如果将stem_标志设置为True,则包含术语“product”也将删除术语“production”;如果将lemma_flag设置为True,则包含术语“productions”。在

  2. 当引理标记设为True时,像“is”和“are”这样的词将被词形化为“be”。如果“be”不在停止语列表中,它将保留。如果决定执行柠檬化,建议您也处理停止词列表

"""Example"""text="products, production, is"stop_words=['product','is']tk=text_tokenizer_xm(text=text,lemma_flag=False,stem_flag=False, \
                       contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())
['products', 'production']
tk=text_tokenizer_xm(text=text,lemma_flag=True,stem_flag=False, \
                       contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())
['production', 'be']
tk=text_tokenizer_xm(text=text,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())
['be']

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java是数据线。getMicrosecondPosition()线程安全?   java我可以设置多个。whereEqualTo在firestore查询中指向文档中的字段?   java Intellij 14 Glassfish服务器未连接。部署不可用   java JPA。如何返回null而不是LazyInitializationException   java TarsosDSP Clap检测   比较基于字符串的java枚举   java空指针异常日历。设定时间   java Hystrix在运行时忽略超时   将数据从Java RESTful服务器推送到Android手机上进行通知   java Jnotify delete vs shift delete问题   java安装失败\u没有匹配\u ABIS res113   TreeJava:传递未实例化的对象引用是如何工作的?   java如何使用Android ringtone manager从资产文件夹播放铃声?   java在Dropwizard的不同状态下使用不同的模拟