空间训练数据中基于NER实体标签的数据过滤

2024-06-28 20:34:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个使用Spacy的NER训练数据,格式如下

[('Christmas Perot 2021 TSO\nSkip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM\nPOPS I Christmas at The Perot\nCLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401\nA Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.\nDon’t miss seeing the winner of TSO’s 11th Annual Celebrity Conductor Competition\nBack to Events 2019 Texarkana Symphony Orchestra',
  {'entities': [(375, 399, 'organization'),
    (290, 318, 'organization'),
    (220, 242, 'production_name'),
    (169, 186, 'performance_date'),
    (189, 202, 'auditorium'),
    (205, 212, 'performance_starttime'),
    (409, 428, 'organization')]})]

数据是元组中的第一个元素。在实体中,数字表示数据中实体的字符位置(开始和结束)。某些行没有任何实体。例如,第一行Christmas Perot 2021 TSO没有任何实体。我需要删除没有实体的句子。可以根据.\n字符删除句子。我获得了基于字符编号的实体数据,但我没有设法删除没有标记的句子

代码

from tqdm import tqdm
import spacy
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(train_data): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        print(start,end,span,label)
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents

Tags: theto数据text实体fordoclabel
1条回答
网友
1楼 · 发布于 2024-06-28 20:34:44

这个怎么样:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import numpy as np

foo = \
    [('''Christmas Perot 2021 TSO
Skip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM
POPS I Christmas at The Perot
CLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401
A Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.
Don\xe2\x80\x99t miss seeing the winner of TSO\xe2\x80\x99s 11th Annual Celebrity Conductor Competition
Back to Events 2019 Texarkana Symphony Orchestra''',
     {'entities': [
    (375, 399, 'organization'),
    (290, 318, 'organization'),
    (220, 242, 'production_name'),
    (169, 186, 'performance_date'),
    (189, 202, 'auditorium'),
    (205, 212, 'performance_starttime'),
    (409, 428, 'organization'),
    ]})]

print(foo[0][0])
sentences = re.split(r'\.|\n', foo[0][0])
sentence_lengths = list(map(len, sentences))

cumulative_sentence_length = np.cumsum(sentence_lengths) - 1

pick_indices = set()

entities = foo[0][1]['entities']

for e in entities:
    # only pick the first index (→ second [0])
    idx = np.where(e[0] < cumulative_sentence_length)[0][0]
    print('\n\nIndex:', idx, 'Entity:', e, 'Range:', [
        [0, *cumulative_sentence_length][idx],
        [0, *cumulative_sentence_length][idx+1]
    ], '\nSentence:', sentences[idx])
    pick_indices.add(idx)

print(pick_indices)
print('\n'.join([sentences[i] for i in pick_indices]))

输出是第一、第二、第三和第四({2, 3, 4, 7})句。我们的想法是

  1. 分句
  2. 累积句子的长度
  3. 检查实体开始索引是否在范围内(并专门选择第一个索引)
  4. (可选)您可以使用实体的结束索引进行健全性检查

请看一下cumulative_sentence_length变量,它包含值[ 23 145 209 238 320 323 327 467 467 552 600],这是句子间隔的上限

当您正在处理一个数据科学主题时,我认为numpy的使用对您来说没有任何障碍

相关问题 更多 >