如何在预处理文档时动态断言原始文本中的单词边界?

2024-09-21 01:17:57 发布

您现在位置:Python中文网/ 问答频道 /正文

首先,我对此进行了研究,发现匹配与一个句子中的单词边界密切相关,或者最多建议使用标记化符,这不是我要找的。我的问题如下:

我当前的任务是对一个非结构化数据进行预处理,它遵循以下管道-将PDF转换为TXT文件,并给出如下几句话:

s e ar c h t h i s s t r ing for a def e c t

我真正想要的是:

search this string for a defect

我所要寻找的是在NLP中实现这类场景的几种可能的方法。 提前谢谢!你知道吗


Tags: 文件数据标记txtfor管道pdf单词
1条回答
网友
1楼 · 发布于 2024-09-21 01:17:57

使用this file作为单词列表。你知道吗

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

s = 's e ar c h t h i s s t r ing for a def e c t'.replace(' ','')
print(infer_spaces(s))

相关问题 更多 >

    热门问题