Python textgo包_程序模块 - PyPI

我们去玩文字游戏吧！

textgo的Python项目详细描述

文本go

TextGo是一个python包，可以帮助您方便高效地处理文本数据。它是一个强大的NLP工具，它提供了各种api，包括文本预处理、表示、相似度计算、文本搜索和分类。此外，它还支持英语和汉语。在

亮点

支持中英文文本预处理
提供各种文本表示算法，包括BOW、TF-IDF、LDA、LSA、PCA、Word2Vec/GloVe/FastText、BERT。。。在
支持基于Faiss的快速文本搜索
支持多种文本分类算法，包括FastText、TextCNN、TextRNN、TextRCNN\u Att、Bert、XLNet
只需几行代码就可以非常容易地使用/使用

安装

使用pip安装和更新：
pip install textgo

注意：在python3上测试成功。
提示：fasttext包需要手动安装，如下所示：

git clone https://github.com/facebookresearch/fastText.git
cd fastText-master
make
pip install .

入门

1。文本预处理

Clean text

^{pr2}$

输出：['自然语言处理是计算机科学领域与人工智能领域中的一个重要方向', '文本预处理其实很简单']

# English
tp2 = Preprocess(lang='en')
texts2 = ["<text>Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language<\text>"]
ptexts2 = tp2.clean(texts2)
print(ptexts2)

输出：['natural language processing usually shortened as nlp is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language']

Tokenize和drop stopwords

# Chinese
tokens1 = tp1.tokenize(ptexts1)
print(tokens1)

输出：[['自然语言', '处理', '计算机科学', '领域', '人工智能', '领域', '中', '重要', '方向'], ['文本', '预处理', '其实', '很', '简单']]

# English
tokens2 = tp2.tokenize(ptexts2)
print(tokens2)

输出：[['natural', 'language', 'processing', 'usually', 'shortened', 'nlp', 'branch', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'humans', 'using', 'natural', 'language']]

预处理（Clean+Tokenize+Remove stopwords+Join words）

# Chinese
ptexts1 = tp1.preprocess(texts1)
print(ptexts1)

输出：['自然语言处理计算机科学领域人工智能领域中重要方向', '文本预处理其实很简单']

# English
ptexts2 = tp2.preprocess(texts2)
print(ptexts2)

输出：['natural language processing usually shortened nlp branch artificial intelligence deals interaction computers humans using natural language']

2。文本表示

from textgo import Embeddings
petxts = ['自然语言 处理 计算机科学 领域 人工智能 领域 中 重要 方向', '文本 预处理 其实 很 简单']
emb = Embeddings()
# BOW
bow_emb = emb.bow(ptexts)

# TF-IDF
tfidf_emb = emb.tfidf(ptexts)

# LDA
lda_emb = emb.lda(ptexts, dim=2)

# LSA
lsa_emb = emb.lsa(petxts, dim=2)

# PCA
pca_emb = emb.pca(ptexts, dim=2)

# Word2Vec
w2v_emb = emb.word2vec(ptexts, method='word2vec', model_path='model/word2vec.bin')

# GloVe
glove_emb = emb.word2vec(ptexts, method='glove', model_path='model/glove.bin')

# FastText
ft_emb = emb.word2vec(ptexts, method='fasttext', model_path='model/fasttext.bin')

# BERT
bert_emb = emb.bert(ptexts, model_path='model/bert-base-chinese')

提示：对于像Word2Vec和BERT这样的方法，可以先加载模型，然后获取嵌入，以避免重复加载模型。以伯特为例：

emb.load_model(method="bert", model_path='model/bert-base-chinese')
bert_emb1 = emb.bert(ptexts1)
bert_emb2 = emb.bert(ptexts2)

3。相似性计算

基于上述文本之间的相似度/相似度表示。例如，我们可以使用bert语句嵌入来逐个计算两个句子之间的余弦相似度。在

from textgo import TextSim
texts1 = ["她的笑渐渐变少了。","最近天气晴朗适合出去玩！"]
texts2 = ["她变得越来越不开心了。","近来总是风雨交加没法外出！"]

ts = TextSim(lang='zh', method='bert', model_path='model/bert-base-chinese')
sim = ts.similarity(texts1, texts2, mutual=False)
print(sim)

输出：[0.9143135, 0.7350756]

此外，我们还可以通过设置mutual=True来计算两个数据集之间每个句子之间的相似度。在

sim = ts.similarity(texts1, texts2, mutual=True)
print(sim)

输出：array([[0.9143138 , 0.772496 ], [0.704296 , 0.73507595]], dtype=float32)

4。文本搜索

它还支持基于余弦相似度或欧几里得距离在大型文本数据库中搜索查询文本。它提供了两种实现方式：适用于小数据集的普通实现和基于Faiss的适用于大数据集的优化实现。在

from textgo import TextSim
# query texts
texts1 = ["A soccer game with multiple males playing."]
# database
texts2 = ["Some men are playing a sport.", "A man is driving down a lonely road.", "A happy woman in a fairy costume holds an umbrella."]
ts = TextSim(lang='en', method='word2vec', model_path='model/word2vec.bin')

Normal search

res = ts.get_similar_res(texts1, texts2, metric='cosine', threshold=0.5, topn=2)
print(res)

输出：[[(0, 'Some men are playing a sport.', 0.828474), (1, 'A man is driving down a lonely road.', 0.60927737)]]

快速搜索

ts.build_index(texts2, metric='cosine')
res = ts.search(texts1, threshold=0.5, topn=2)
print(res)

输出：[[(0, 'Some men are playing a sport.', 0.828474), (1, 'A man is driving down a lonely road.', 0.60927737)]]

5。文本分类

用几行字训练一个文本分类器。支持的型号：FastText、TextCNN、TextRNN、TextRCNN、TextRCNN U Att、Bert、XLNet。在

from textgo import Classifier

# Prepare data
X = [text1, text2, ... textn]
y = [label1, label2, ... labeln]

# load config
config_path = "./config.ini"  # Include all model parameters
model_name = "Bert" # Supported models: FastText, TextCNN, TextRNN, TextRCNN, TextRCNN_Att, Bert, XLNet
args = load_config(config_path, model_name) 
args['model_name'] = model_name 
args['save_path'] = "output/%s"%model_name

# train 
clf = Classifier(args) 
clf.train(X_train, y_train, evaluate_test=False) # If evaluate_test=True, then it will split 10% for test dataset and evaluate on test dataset. 

# predict
predclass = clf.predict(X_train)

资源

1。预训练单词嵌入

中文
各种文献词向量：https://github.com/Embedding/Chinese-Word-Vectors
腾讯AI实验室

英语
手套：https://nlp.stanford.edu/projects/glove/
快速文本：https://fasttext.cc/docs/en/english-vectors.html
单词2vec:https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

2。预训练模型

https://huggingface.co/models

许可证

TextGo是麻省理工学院授权的。在

欢迎加入QQ群-->： 979659372

textgo 1.4

textgo的Python项目详细描述

文本go

亮点

安装

入门

1。文本预处理

2。文本表示

3。相似性计算

4。文本搜索

5。文本分类

资源

1。预训练单词嵌入

中文
各种文献词向量：https://github.com/Embedding/Chinese-Word-Vectors
腾讯AI实验室

英语
手套：https://nlp.stanford.edu/projects/glove/
快速文本：https://fasttext.cc/docs/en/english-vectors.html
单词2vec:https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

2。预训练模型

许可证

推荐PyPI第三方库

chaudio

django-image-loupe

uifunc

lith

django-tastypie-legac

django-throttling

growler-jade

completethat

buildout.environ

adafruit-circuitpython-is31fl3731

mailman-api

colossus

rds-log

prolice

DFO-LS

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

textgo 1.4

textgo的Python项目详细描述

文本go

亮点

安装

入门

1。文本预处理

2。文本表示

3。相似性计算

4。文本搜索

5。文本分类

资源

1。预训练单词嵌入

中文 各种文献词向量：https://github.com/Embedding/Chinese-Word-Vectors腾讯AI实验室

英语 手套：https://nlp.stanford.edu/projects/glove/快速文本：https://fasttext.cc/docs/en/english-vectors.html单词2vec:https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

2。预训练模型

许可证

推荐PyPI第三方库

chaudio

django-image-loupe

uifunc

lith

django-tastypie-legac

django-throttling

growler-jade

completethat

buildout.environ

adafruit-circuitpython-is31fl3731

mailman-api

colossus

rds-log

prolice

DFO-LS

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

中文
各种文献词向量：https://github.com/Embedding/Chinese-Word-Vectors
腾讯AI实验室

英语
手套：https://nlp.stanford.edu/projects/glove/
快速文本：https://fasttext.cc/docs/en/english-vectors.html
单词2vec:https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

导航栏

项目链接

标签