用伯特来填空
fitbert的Python项目详细描述
菲伯特
FitBert((f)Ir(i)n(t)H-Buffes,(伯特))是一个库,用于使用BERT从选项列表中填充文本的一部分中的空白。以下是fitbert的预期用例:
- 服务(统计模型或更简单的服务)建议替换/更正一段文本
- 该服务是专门针对某个领域的,不擅长全局,例如语法
- 该服务将传递文本段,并标识要替换的单词和建议列表
- Fitbert压碎除了最好的建议:肌肉:
安装
此软件当前未经授权。如果您不在Qordoba工作,您将无法合法使用它。但是,我们正在努力使其可用,可能是在Apache2.0下。它将作为一个pypi包提供。
用法
A Jupyter notebook with a short introduction is available here.
如果torch.cuda.is_available()
,fitbert将自动使用gpu。这是件好事,因为CPU推断时间真的很糟糕:
以下是GPU推断时间的情况
用作库/服务器
fromfitbertimportFitBert# in theory you can pass a model_name and tokenizer, but currently only# bert-large-uncased and BertTokenizer are available# this takes a while and loads a whole big BERT into memoryfb=FitBert()masked_string="Why Bert, you're looking ***mask*** today!"options=['buff','handsome','strong']ranked_options=fb.rank(masked_string,options=options)# >>> ['handsome', 'strong', 'buff']# orfilled_in=fb.fitb(masked_string,options=options)# >>> "Why Bert, you're looking handsome today!"
我们通常发现自己知道该暗示什么动词,但不知道该暗示什么变化:
fromfitbertimportFitBertfb=FitBert()masked_string="Why Bert, you're ***mask*** handsome today!"options=['looks']filled_in=fb.fitb(masked_string,options=options)# >>> "Why Bert, you're looking handsome today!"# under the hood, we notice there is only one suggestion and act as if# fitb was called with delemmatize=True:filled_in=fb.fitb(masked_string,options=options,delemmatize=True)
如果您已经在使用pytorch_pretrained_bert.BertForMaskedLM
,并且已经实例化了bertformaskedlm的一个实例,那么您可以将其传入以重用它:
BLM=pytorch_pretrained_bert.BertForMaskedLM.from_pretrained(model_name)fb=FitBert(model=BLM)
您也可以让fitbert为您屏蔽字符串
fromfitbertimportFitBertfb=FitBert()unmasked_string="Why Bert, you're looks handsome today!"span_to_mask=(17,22)masked_string,masked=fb.mask(unmasked_string,span_to_mask)# >>> "Why Bert, you're ***mask*** handsome today!", 'looks'# you can set options = [masked] or use any List[str]options=[masked]filled_in=fb.fitb(masked_string,options=options)# >>> "Why Bert, you're looking handsome today!"
这样做有一种方便的方法:
unmasked_string="Why Bert, you're looks handsome today!"span_to_mask=(17,22)filled_in=fb.mask_fitb(unmasked_string,span_to_mask)# >>> "Why Bert, you're looking handsome today!"
客户
如果要将字符串发送到fitbert服务器,则需要自己屏蔽字符串,或标识要屏蔽的范围:
fromfitbertimportFitBerts="This might be justified as a means of signalling the connection between drunken driving and fatal accidents."better_string,span_to_change=MyRuleBasedNLPModel.remove_overly_fancy_language(s)assertbetter_string=="This might be justified to signalling the connection between drunken driving and fatal accidents.","Notice 'as a means of' became 'to', but we didn't re-conjuagte signalling, or fix the spelling mistake"assertspan_to_change==(27,37),"This span is the start and stop of the characters for the substring 'signalling'."masked_string,replaced_substring=FitBert.mask(better_string,span_to_change)assertmasked_string=="This might be justified to ***mask*** the connection between drunken driving and fatal accidents."assertreplaced_substring=="signalling"FitBertServer.fitb(masked_string,options=[replaced_substring])
在掩蔽自己的时候这样做的好处是,如果内部使用的掩蔽令牌发生变化,您不必知道。另外,您不需要创建fitbert的实例,因此您不必花费下载预先训练的bert模型的成本。
但是,您也可以编写CallFitBertServer
函数来获取一个无掩码字符串和一个跨度,比如:
FitBertServer.mask_fitb(better_string,span_to_change)
然后根本不需要在客户机中安装FitBert
。
开发
使用python -m pytest
或python -m pytest -m "not slow"
运行测试,跳过加载预训练的bert的20秒。
确认
我正试图与NodoBird取得联系,他是上面描述的伯特的绝妙肖像的艺术家。