用pytorch预训练bert模型求解toeic空白问题。
toeicbert的Python项目详细描述
托业伯特
76%的正确率只有预先训练的伯特模型在托业!!
这是一个主题为TOEIC(Test of English for International Communication) problem solving using pytorch-pretrained-BERT model.
的项目,我之所以使用huggingface的pytorch-pretrained-BERT model是为了进行预训练或更容易进行微调。我解决了唯一的空白问题,而不是整个问题。有两种类型的空白问题:
- 选择正确的语法类型。
Q) The teacher had me _________ scales several times a day.
1. play (Answer)
2. to play
3. played
4. playing
- 选择正确的词汇类型。
Q) The wet weather _________ her from going shopping.
1. interrupted
2. obstructed
3. impeded
4. discouraged (Answer)
为什么是伯特?在pretrained bert中,它包含上下文信息。所以它可以找到更多的上下文或语法句子,不清楚,一点点。我的灵感来自blog post的语法检查器。
Can We Use BERT as a Language Model to Assign a Score to a Sentence?
BERT uses a bidirectional encoder to encapsulate a sentence from left to right and from right to left. Thus, it learns two representations of each word-one from left to right and one from right to left-and then concatenates them for many downstream tasks.
评估
我只评估了{{ STR 1 } $预训练伯特模型(非微调)< /强>以检查语法或词汇错误。在数学表达式之上,X
是一个疑问句。而n
是问题的数目:{a, b, c, d}
。C
子集表示应答候选令牌:warranty
的C
是['warrant', '##y']
。V
表示总词汇。
不止一个令牌有问题。我通过得到每个张量的平均值来解决这个问题。例如)is being formed
作为['is', 'being', 'formed']
然后,我们在L_n(T_n)
中找到argmax。
predictions=model(question_tensors,segment_tensors)# predictions : [batch_size, sequence_length, vocab_size]predictions_candidates=predictions[0,masked_index,candidate_ids].mean()
评估结果。
仅使用预训练的bert模型的出色结果
bert-base-uncased
:12层,768隐藏,12头,110m参数bert-large-uncased
:24层,1024隐藏,16头,340m参数bert-base-cased
:12层,768隐藏,12头,110m参数bert-large-cased
:24层,1024隐藏,16头,340m参数
总共7067个数据集:使用model.eval()
bert-base-uncased | bert-base-cased | bert-large-uncased | bert-large-cased | |
---|---|---|---|---|
Correct Num | 5192 | 5398 | 5321 | 5148 |
Percent | 73.46% | 76.38% | 75.29% | 72.84 |
使用python pip包快速入门。
以pip开头
$ pip install toeicbert
run&option
$ python toeicbert -m bert-base-uncased -f test.json
-m, --model
:huggingface的pytorch预训练bert中的bert模型名:bert-base-uncased
,bert-large-uncased
,bert-base-cased
,bert-large-cased
。-f, --file
:要评估的json文件,请参见json格式,test.json。键(问题1、2、3、4)是必需选项,但回答不是。
^有问题的{
}将被替换为 [MASK]
{"1":{"question":"The teacher had me _ scales several times a day.","answer":"play","1":"play","2":"to play","3":"played","4":"playing"},"2":{"question":"The teacher had me _ scales several times a day.","1":"play","2":"to play","3":"played","4":"playing"}}
作者
- 郑泰桓(Jeff Jung)@Graykode,京熙大学(本科)。
- 作者电子邮件:nlkey2022@gmail.com
感谢Hwan Suk Gang(京熙大学)收集数据集(7114
数据集)