为ML注释文本数据后如何继续？

{"annotatable":{"parts":["s1p1"]}, "anncomplete":true, "sources":[], "metas":{}, "entities":[{"classId":"e_1","part":"s1p1","offsets": [{"start":11,"text":"This is the text"}],"coordinates":[],"confidence": {"state":"pre-added","who":["user:1"],"prob":1},"fields":{"f_4": {"value":"3","confidence":{"state":"pre-added","who": ["user:1"],"prob":1}}},"normalizations":{}},"normalizations":{}}], "relations":[]}

1条回答

网友

1楼 · 发布于 2024-09-18 14:35:58

因此，假设您有一个JSON文件，其中标签由原始txt文件中的相应行索引：

{
  0: "politics"
  1: "sports",
  2: "weather",
}

以及具有相应索引的原始文本的txt文件：

0 The American government has launched ... today.
1 FC Barcelona has won ... the country.
2 The forecast looks ... okay.

然后，首先，在继续对文本进行特征化并构建机器学习模型之前，您确实需要将示例与其标签连接起来。如果您的示例（如我的示例中）通过索引或ID或任何其他标识信息对齐，您可以执行以下操作：

import json

with open('labels.json') as json_file:
    labels = json.load(json_file)
    # This results in a Python dictionary where you can look-up a label given an index.

with open(raw.txt) as txt_file:
    raw_texts = txt_file.readlines()
    # This results in a list where you can retrieve the raw text by index like this: raw_texts[index].

现在您可以将原始文本与标签进行匹配，为了便于使用，您可能希望将它们放在一个数据帧中（假设它们目前的订购方式相同）：

import pandas as pd

data = pd.DataFrame(
    {'label': labels.values(),
     'text': raw_texts
    })

#    label      text
# 0  politics   Sentence_1
# 1  sports     Sentence_2
# 2  weather    Sentence_3

现在，您可以使用不同的机器学习库，但我推荐初学者使用的库肯定是^{}。它很好地解释了如何将原始文本字符串转换为机器学习可用的功能：

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files

然后，如何使用这些特征训练分类器：

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier

我展示的提供的DataFrame应该是测试这些scikit-learn技术的正确开始

相关问题更多 >

编程相关推荐

热门问题

热门文章