googlecolab中的BERT多类文本分类

2024-06-26 01:34:09 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在研究一组社交媒体评论（包括youtube链接）作为输入功能，Myers Biggs个性档案作为目标标签：

    type    posts
0   INFJ    'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1   ENTP    'I'm finding the lack of me in these posts ver...
2   INTP    'Good one _____ https://www.youtube.com/wat...
3   INTJ    'Dear INTP, I enjoyed our conversation the o...
4   ENTJ    'You're fired.|||That's another silly misconce...

但是根据我的发现，伯特希望数据帧的格式是这样的：

^{pr2}$

结果的输出必须是对一组被分成四列的评论的预测，每一列对应一个个性特征，例如，“Mind”=1是外向型的标签。基本上把INFJ这样的类型分成“心智”、“能量”、“自然”、“战术”，比如：

    type    post    Mind    Energy  Nature  Tactics
0   INFJ    'url-web    0   1   0   1
1   INFJ    url-web 0   1   0   1
2   INFJ    enfp and intj moments url-web sportscenter n... 0   1   0   1
3   INFJ    What has been the most life-changing experienc...   0   1   0   1
4   INFJ    url-web url-web On repeat for most of today.    0   1   0   1

我安装了Pythorch pretrained bert，使用的是：

!pip install pytorch-pretrained-bert

我已导入模型，并尝试使用以下方法标记“posts”列：

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenized_train = tokenizer.tokenize(train)

但收到以下错误：

TypeError: ord() expected a character, but string of length 5 found

我根据Pythorch预先培训过的BertGithub回购和Youtube视频进行了尝试。在

我是一名数据科学实习生，根本没有深入的学习经验。我只想用最简单的方法来试验BERT模型来预测多类分类输出，这样我就可以将结果与我们目前正在研究的更简单的文本分类模型进行比较。我在googlecolab中工作，结果输出应该是一个.csv文件。在

我知道这是一个复杂的模型，围绕模型的所有文档和示例都很复杂（微调层等等），但是对于一个具有最少软件工程经验的初学者来说，如果有这样的帮助，我们将不胜感激。在

Tags： of the 模型 com web url youtube www

1条回答

网友

1楼 · 发布于 2024-06-26 01:34:09

我建议您从一个简单的BERT分类任务开始，例如下面这个优秀的教程：https://mccormickml.com/2019/07/22/BERT-fine-tuning/

然后您可以通过以下方式进入多标签：https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d

只有这样，我才建议您在自己的数据集上尝试您的任务。在

googlecolab中的BERT多类文本分类

相关问题更多 >

编程相关推荐

热门问题

热门文章

googlecolab中的BERT多类文本分类

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >