我的句子里面引用了文本,比如:
Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?
我试图用正则表达式来掩盖引用的部分,但它并不准确。例如,最后一句话:
txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))
输出为:
Reread these sentences: "<quote>" mean?
相反,它应该是:
Reread these sentences: "<quote>" What does the word "courtship" mean?
因为我已经>;10k个实例,很难找到一个适用于所有情况的通用正则表达式模式
我的问题是,是否有任何库(可能基于神经网络实现)或方法来解决这个问题
另一种方法是使用与regex完全不同的技术,shlex
shlex.split
在拆分为单词时考虑引号,可选的posix
参数将引号保留在结果中。通过它的输出,您可以创建一个类似于您描述的字符串输出:
shlex
不会将卷曲引号解释为引号(例如第2行),因此如果您有卷曲引号,您应该在将每一行输入之前.replace()
将它们转换为引号李>输出:
对于这些示例,请使用
见Python proof。对于各种类型的引号,请使用单独的命令,这样更易于控制
结果:
相关问题 更多 >
编程相关推荐