Python finbert-embedding包_程序模块 - PyPI

来自财务伯特的嵌入

finbert-embedding的Python项目详细描述

芬伯特嵌入

FinBERT模型（金融领域）中的代币和句子级嵌入。在

Google发布的BERT概念简单，经验性强，它在11个自然语言处理任务上获得了最新的结果。在

本计画的目的是从FinBERT（阿姆斯特丹大学）的预训练模型中获得单词或句子的嵌入。FinBERT，这是一个BERT语言模型，进一步训练金融新闻文章，以适应金融领域。它实现了FiQA情绪评分和金融用语库数据集的最新水平。纸张here。在

你可以直接利用BERT-NLD建立各种金融实体的分类模型，或者直接用BERT-NLD来构建金融实体的分类模型

特点

创建一个抽象来删除处理推断预训练的FinBERT模型。在
只需要两行代码就可以为文本句子获取句子/令牌级别的编码。在
这个包内在地处理oov（词汇表外）。在
下载并安装FinBERT预先训练过的模型（第一次初始化，下一节将介绍用法）。在

安装

（建议创建conda env以进行隔离并避免依赖冲突）

pip install finbert-embedding==0.1.4

注意：如果安装此软件包时出错（Tf的常见错误）：

安装收集的软件包：wrapt、tensorflow
发现现有安装：wrapt 1.10.11
错误：无法卸载“wrapt”。这是一个distutils安装的项目。。。。在

那么，就这样做：

^{pr2}$

用法1

生成的单词嵌入是每个单词768维嵌入的列表。
生成的句子嵌入是768维嵌入，是每个标记的平均值。在

fromfinbert_embedding.embeddingimportFinbertEmbeddingtext="Another PSU bank, Punjab National Bank which also reported numbers managed to see a slight improvement in asset quality."# Class Initialization (You can set default 'model_path=None' as your finetuned BERT model path while Initialization)finbert=FinbertEmbedding()word_embeddings=finbert.word_vector(text)sentence_embedding=finbert.sentence_vector(text)print("Text Tokens: ",finbert.tokens)# Text Tokens:  ['another', 'psu', 'bank', ',', 'punjab', 'national', 'bank', 'which', 'also', 'reported', 'numbers', 'managed', 'to', 'see', 'a', 'slight', 'improvement', 'in', 'asset', 'quality', '.']print('Shape of Word Embeddings: %d x %d'%(len(word_embeddings),len(word_embeddings[0])))# Shape of Word Embeddings: 21 x 768print("Shape of Sentence Embedding = ",len(sentence_embedding))# Shape of Sentence Embedding =  768

用法2

对于下游任务来说，一个像样的表示并不意味着它在余弦距离方面是有意义的。因为余弦距离是一个线性空间，所有维度的权重相等。不管怎样，如果你想用绝对的余弦值来排序，请不要用余弦值。在

也就是说，不要使用：
如果余弦（A，B）>0.9，则A和B相似

请考虑以下事项：
如果余弦（A，B）>；余弦（A，C），则A更像B而不是C

fromfinbert_embedding.embeddingimportFinbertEmbeddingtext="After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."finbert=FinbertEmbedding()word_embeddings=finbert.word_vector(text)fromscipy.spatial.distanceimportcosinediff_bank=1-cosine(word_embeddings[9],word_embeddings[18])same_bank=1-cosine(word_embeddings[9],word_embeddings[5])print('Vector similarity for similar bank meanings (bank vault & bank robber):  %.2f'%same_bank)print('Vector similarity for different bank meanings (bank robber & river bank):  %.2f'%diff_bank)# Vector similarity for similar bank meanings (bank vault & bank robber):  0.92# Vector similarity for different bank meanings (bank robber & river bank):  0.64

警告

根据伯特的作者雅各布·德夫林： I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

然而，对于[CLS]令牌，如果模型经过微调，它确实变得有意义，其中该令牌的最后一个隐藏层被用作下游序列分类任务的“句子向量”。这个包以类似的方式对句子进行编码。在

待办事项（下一版本）

扩展它以提供段落/文档的单词嵌入（当前，它以一个句子作为输入）。在使用finbert_嵌入之前，使用Spacy或NLTK将段落或文本文档分成句子。在
添加批处理功能。在
处理OOV的更多方法（目前，使用OOV单词的所有标记的平均值）
吸收并扩展到更多预先培训过的财务模型。在

未来目标

使用各种FinBERT语言模型为任何金融标签文本分类任务创建通用的下游框架，如情感分类、金融新闻分类、金融文档分类。在

欢迎加入QQ群-->： 979659372

finbert-embedding 0.1.4

finbert-embedding的Python项目详细描述

芬伯特嵌入

特点

安装

用法1

用法2

警告

待办事项（下一版本）

未来目标

推荐PyPI第三方库

kanilog

SciTools

porter-db

neze-webcli

aiochrome

pygrok

spectre-api

redisGroupMsg

dj-bootstrap-swatches

python-sbtab

dodoo-tester

python-emacs

floem

ghostlord

aplazame-sdk

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

finbert-embedding 0.1.4

finbert-embedding的Python项目详细描述

芬伯特嵌入

特点

安装

用法1

用法2

警告

待办事项（下一版本）

未来目标

推荐PyPI第三方库

kanilog

SciTools

porter-db

neze-webcli

aiochrome

pygrok

spectre-api

redisGroupMsg

dj-bootstrap-swatches

python-sbtab

dodoo-tester

python-emacs

floem

ghostlord

aplazame-sdk

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签