部分重叠批次的映射预测

2024-10-04 11:31:15 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在使用pytorch\u pretrained\u bert for NER，基于这里提供的实现：https://github.com/kamalkraj/BERT-NER

当对较长的文本进行预测时，predict中的bert.py函数返回索引外错误。我目前的理解是，这是由于输入文本中的bert标记的nr（不一定相同，实际上在许多情况下与ex.nltk.word_tokenize()返回的标记的数量不同）造成的。如果超过512（因为这是预先训练的模型的默认长度，如果我理解正确的话），它将返回这个错误（对于较短的输入工作正常）。因此，我决定实现一个包装器，以防bert令牌（即bert自己的令牌化器返回的内容）多于512，生成部分重叠的批。重叠是需要的，因为只要将它分割成几块，填充最后一块（因为我的令牌数量不太可能被512整除，因此最后一个较短的批处理需要填充）就失去了上下文。然而，这意味着，我必须分批进行预测，然后再把事情重新组合起来

这种切割成批（长度512）的方法非常简单。我现在正在做重新组合预测的部分。为此，我编写了以下代码：

def map_batches(batches, blen):
    master = []
    for i in range(len(batches)-1):
        appendbatch = []
        # if first round, append first bit of current batch with only one prediction
        if i == 0:
            appendbatch = batches[0][:int(blen/2)-1]
        thisbatch = batches[i]
        nextbatch = batches[i+1]
        offset = int(blen/2)-1 # think -1 is needed for zero offsetting, but not sure, test/debug!
        for j in range(offset, blen-offset+2): # and debug +2 here too!
            thistuple = thisbatch[j]
            nexttuple = nextbatch[j-offset]
            if thistuple[1] > nexttuple[1]:
                thisbatch[j] = thistuple
            else:
                thisbatch[j] = nexttuple
        appendbatch.extend(thisbatch[int(blen/2)-1:])
        # and at end add last bit of last batch with only one prediction
        if i == len(batches)-2:
            appendbatch.extend(nextbatch[int(blen/2)+1:])
        master.append(appendbatch)
    return master

blen = 6
a = [('I-LOC', 0.1),
     ('O', 0.2),
     ('I-LOC', 0.5),
     ('O', 0.2),
     ('I-LOC', 0.9),
     ('O', 0.1)]

b = [('B-LOC', 0.6),
     ('O', 0.7),
     ('I-LOC', 0.8),
     ('O', 0.9),
     ('O', 0.99),
     ('O', 0.99)]
correct = [[('I-LOC', 0.1), ('O', 0.2), ('B-LOC', 0.6), ('O', 0.7), ('I-LOC', 0.9), ('O', 0.9), ('O', 0.99), ('O', 0.99)]]
batches = [a, b]
master = map_batches(batches, blen)
print('mapped: ', master)
print('correct:', correct)

函数map_batches应该起作用。输入列表仅部分重叠；对于lista中的前两项，我只有一个预测（a本身的预测）。对于接下来的6个标记，我有两个预测，分别对应于a[2:]和b[:6]。在过去的2年里，我又只有b了。函数应该把这些放在一起，基于它们的可能性最大（这里是由数字组成的，在实数代码中，这些当然是标签的置信度得分）

对我的玩具示例a和b进行检查，它可以工作，即上面的输出是：

mapped: [[('I-LOC', 0.1), ('O', 0.2), ('B-LOC', 0.6), ('O', 0.7), ('I-LOC', 0.9), ('O', 0.9), ('O', 0.99), ('O', 0.99)]]
correct: [[('I-LOC', 0.1), ('O', 0.2), ('B-LOC', 0.6), ('O', 0.7), ('I-LOC', 0.9), ('O', 0.9), ('O', 0.99), ('O', 0.99)]]

抱歉，介绍太长了

现在有个问题：

首先，这是相当快的组合。虽然它适用于我的玩具示例，但我有点担心它可能对所有情况（不同的批量大小等）都不正确（因此调试/测试的注释），显然，我计划对它进行更多的调试。不过，任何关于这个问题的想法或意见都会受到欢迎

第二，我想真正的问题是：我怀疑这个问题肯定早就有人解决了（而且可能是以一种更好更有效的方式）。有没有人知道keras、tensorflow、numpy、sklearn或者其他常见的疑点中有什么实现

Tags：函数标记 master map for if batches loc

0条回答

目前没有回答

部分重叠批次的映射预测

相关问题更多 >

编程相关推荐

热门问题

热门文章

部分重叠批次的映射预测

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >