句子中引用文本的检测

2024-06-26 12:40:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我的句子里面引用了文本,比如:

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?

我试图用正则表达式来掩盖引用的部分,但它并不准确。例如,最后一句话:

txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))

输出为:

Reread these sentences: "<quote>" mean?

相反,它应该是:

Reread these sentences: "<quote>" What does the word "courtship" mean?

因为我已经>;10k个实例,很难找到一个适用于所有情况的通用正则表达式模式

我的问题是,是否有任何库(可能基于神经网络实现)或方法来解决这个问题


Tags: oftheissentencesitmeanwhatword
2条回答

另一种方法是使用与regex完全不同的技术,shlex

The shlex class makes it easy to write lexical analyzers for simple syntaxes resembling that of the Unix shell. This will often be useful for writing minilanguages, (for example, in run control files for Python applications) or for parsing quoted strings.

shlex.split在拆分为单词时考虑引号,可选的posix参数将引号保留在结果中。通过它的输出,您可以创建一个类似于您描述的字符串

import shlex

lines = [
'Why did the author use three sentences in a row that start with the words, "it spun"?',
'Why did the queen most likely say  “I would have tea instead.”',
'Why did the fdsfdsf repeat the phrase "he waited" so many times?',
'Why were "the lights of his town growing smaller below them"?',
'What is a fdsfdsf for the word "adjust"?', 'Reread this: "If anybody had asked trial of answered at once, \'My nose.\'" What is the correct definition of the word "trial" as it is used here?',
'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?',
]
for line in lines:
    print(
        " ".join(
            word
            if word[0] != '"' and word[-1] != '"' else '"<quote>"'
            for word in shlex.split(line, posix=False)
        )
    )

输出:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "<quote>" as it is used here?
Reread these sentences: "<quote>" What does the word "<quote>" mean?
  • 注1:shlex不会将卷曲引号解释为引号(例如第2行),因此如果您有卷曲引号,您应该在将每一行输入之前.replace()将它们转换为引号
  • 注2:这将替换所有引用的事件,但如果您只需要第一个事件并保留其余的事件,则可以这样做(非常确定这可以写得更好,但可以将其作为概念证明):
for line in lines:
    new_line = []
    quote_count = 0
    for word in shlex.split(line, posix=False):
        if word[0] == '"' and word[-1] == '"':
            if quote_count < 1:
                quote_count += 1
                new_line.append('"<quote>"')
            else:
                new_line.append(word)
        else:
            new_line.append(word)
    print(' '.join(new_line))

输出:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "<quote>" What does the word "courtship" mean?

对于这些示例,请使用

import re
txt = """Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""
txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)
txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)
print(txt)

Python proof。对于各种类型的引号,请使用单独的命令,这样更易于控制

结果

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  <quote>
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were <quote>?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: <quote> What does the word "courtship" mean?

相关问题 更多 >