空间训练数据中基于NER实体标签的数据过滤

[('Christmas Perot 2021 TSO\nSkip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM\nPOPS I Christmas at The Perot\nCLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401\nA Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.\nDon’t miss seeing the winner of TSO’s 11th Annual Celebrity Conductor Competition\nBack to Events 2019 Texarkana Symphony Orchestra', {'entities': [(375, 399, 'organization'), (290, 318, 'organization'), (220, 242, 'production_name'), (169, 186, 'performance_date'), (189, 202, 'auditorium'), (205, 212, 'performance_starttime'), (409, 428, 'organization')]})]

from tqdm import tqdm import spacy nlp = spacy.blank("en") # load a new spacy model db = DocBin() # create a DocBin object for text, annot in tqdm(train_data): # data in previous format doc = nlp.make_doc(text) # create doc object from text ents = [] for start, end, label in annot["entities"]: # add character indexes span = doc.char_span(start, end, label=label, alignment_mode="contract") print(start,end,span,label) if span is None: print("Skipping entity") else: ents.append(span) doc.ents = ents # label the text with the ents

1条回答

网友

1楼 · 发布于 2024-06-28 20:34:44

这个怎么样：

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import numpy as np

foo = \
    [('''Christmas Perot 2021 TSO
Skip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM
POPS I Christmas at The Perot
CLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401
A Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.
Don\xe2\x80\x99t miss seeing the winner of TSO\xe2\x80\x99s 11th Annual Celebrity Conductor Competition
Back to Events 2019 Texarkana Symphony Orchestra''',
     {'entities': [
    (375, 399, 'organization'),
    (290, 318, 'organization'),
    (220, 242, 'production_name'),
    (169, 186, 'performance_date'),
    (189, 202, 'auditorium'),
    (205, 212, 'performance_starttime'),
    (409, 428, 'organization'),
    ]})]

print(foo[0][0])
sentences = re.split(r'\.|\n', foo[0][0])
sentence_lengths = list(map(len, sentences))

cumulative_sentence_length = np.cumsum(sentence_lengths) - 1

pick_indices = set()

entities = foo[0][1]['entities']

for e in entities:
    # only pick the first index (→ second [0])
    idx = np.where(e[0] < cumulative_sentence_length)[0][0]
    print('\n\nIndex:', idx, 'Entity:', e, 'Range:', [
        [0, *cumulative_sentence_length][idx],
        [0, *cumulative_sentence_length][idx+1]
    ], '\nSentence:', sentences[idx])
    pick_indices.add(idx)

print(pick_indices)
print('\n'.join([sentences[i] for i in pick_indices]))

输出是第一、第二、第三和第四（{2, 3, 4, 7}）句。我们的想法是

分句
累积句子的长度
检查实体开始索引是否在范围内（并专门选择第一个索引）
（可选）您可以使用实体的结束索引进行健全性检查

请看一下cumulative_sentence_length变量，它包含值[ 23 145 209 238 320 323 327 467 467 552 600]，这是句子间隔的上限

当您正在处理一个数据科学主题时，我认为numpy的使用对您来说没有任何障碍

相关问题更多 >

编程相关推荐

热门问题

热门文章