Python tokenizer-xm包_程序模块 - PyPI

标记化，包括收缩，柠檬化和茎。

tokenizer-xm的Python项目详细描述

简介

这个包是我发现对文本预处理有用的几个包的集合，包括gensim和ntlk。我把它们放在一起，创建了一个更全面、更方便的管道。在

安装

pip install tokenizer_xm

使用

处理单个文本字符串

^{pr2}$

print("Original text:")print(example_text)print("---")print("Simple Preprocessed:")print("---")tk=text_tokenizer_xm(text=example_text,lemma_flag=False,stem_flag=False,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with regular contractions (e.g. I've -> I have):")# In this package, I included a dictionary of regular contractions for your conveniencetk=text_tokenizer_xm(text=example_text,lemma_flag=False,stem_flag=False, \
                       contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with lemmatization:")tk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=False, \
                       contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Pre-processing with lemmatization and stemming:")# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable latertk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=[])print(tk.txt_pre_pros())print("---")print("Adding stop words")# This package uses the SnowballStemmer from ntlk.stem. I will try to make it customizable latertk=text_tokenizer_xm(text=example_text,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=["this",'be',"an",'it'])print(tk.txt_pre_pros())print("---")

Original text:
This is an amazing product! I've been using it for almost a year now and it's clearly better than any other products I've used.
---
Simple Preprocessed:
---
['this', 'is', 'an', 'amazing', 'product', 've', 'been', 'using', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'clearly', 'better', 'than', 'any', 'other', 'products', 've', 'used']
---
Pre-processing with regular contractions (e.g. I've -> I have):
['this', 'is', 'an', 'amazing', 'product', 'have', 'been', 'using', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'has', 'it', 'is', 'clearly', 'better', 'than', 'any', 'other', 'products', 'have', 'used']
---
Pre-processing with lemmatization:
['this', 'be', 'an', 'amaze', 'product', 'have', 'be', 'use', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clearly', 'better', 'than', 'any', 'other', 'product', 'have', 'use']
---
Pre-processing with lemmatization and stemming:
['this', 'be', 'an', 'amaz', 'product', 'have', 'be', 'use', 'it', 'for', 'almost', 'year', 'now', 'and', 'it', 'have', 'it', 'be', 'clear', 'better', 'than', 'ani', 'other', 'product', 'have', 'use']
---
Adding stop words
['amaz', 'product', 'have', 'use', 'for', 'almost', 'year', 'now', 'and', 'have', 'clear', 'better', 'than', 'ani', 'other', 'product', 'have', 'use']
---

处理文本列表

text_list=['I am ready',"This is great","I love it"]tk=text_tokenizer_xm(text=text_list,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=[])# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros_all())print("---")

0          [be, readi]
1    [this, be, great]
2           [love, it]
dtype: object
---

停止词删除和词干化/词干化的顺序

当前的算法在停止词删除之前执行词干化和词干化。因此

在
定义停止词列表时需要小心。例如，如果将stem_标志设置为True，则包含术语“product”也将删除术语“production”；如果将lemma_flag设置为True，则包含术语“productions”。在
在
在
当引理标记设为True时，像“is”和“are”这样的词将被词形化为“be”。如果“be”不在停止语列表中，它将保留。如果决定执行柠檬化，建议您也处理停止词列表
在

"""Example"""text="products, production, is"stop_words=['product','is']tk=text_tokenizer_xm(text=text,lemma_flag=False,stem_flag=False, \
                       contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())

['products', 'production']

tk=text_tokenizer_xm(text=text,lemma_flag=True,stem_flag=False, \
                       contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())

['production', 'be']

tk=text_tokenizer_xm(text=text,lemma_flag=True,stem_flag=True, \
                       contractions=contractions,stopwords=stop_words)# Use the .txt_pre_pros_all method instead when the input is a corpusprint(tk.txt_pre_pros())

['be']

欢迎加入QQ群-->： 979659372

tokenizer-xm 0.5

tokenizer-xm的Python项目详细描述

简介

安装

使用

处理单个文本字符串

处理文本列表

停止词删除和词干化/词干化的顺序

推荐PyPI第三方库

ps.plone.basepolic

openerp-stock-invoice-directl

lith

sp-health

mysolr

tubedreams

north-utils

edison

ely.advancedquer

hera

odoo11-addon-product-variant-available-in-pos

uadetector

notifySms

aiochrome

flask-paginate

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

tokenizer-xm 0.5

tokenizer-xm的Python项目详细描述

简介

安装

使用

处理单个文本字符串

处理文本列表

停止词删除和词干化/词干化的顺序

推荐PyPI第三方库

ps.plone.basepolic

openerp-stock-invoice-directl

lith

sp-health

mysolr

tubedreams

north-utils

edison

ely.advancedquer

hera

odoo11-addon-product-variant-available-in-pos

uadetector

notifySms

aiochrome

flask-paginate

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签