非引用引用

In [1]: from nltk import SExprTokenizer ...: ...: ...: sentences = [ ...: """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""", ...: """Alice replied in a very melancholy voice. She continued, 'I'll try again.'""" ...: ] ...: ...: tokenizer = SExprTokenizer(parens='""', strict=False) ...: for sentence in sentences: ...: for item in tokenizer.tokenize(sentence): ...: print(item) ...: print("----") ...: Well, I've tried to say " How Doth the Little Busy Bee, " but it all came different! ---- Alice replied in a very melancholy voice. She continued, 'I'll try again.'

1条回答

网友

1楼 · 发布于 2024-09-29 23:18:21

实际上，SExprTokenizer也是一种基于regex的方法，可以从链接到的源代码中看到。
从资料来源也可以看出，作者显然没有考虑到开头和结尾的“paren”是用同一个字符来表示的。嵌套的深度在同一次迭代中增加或减少，因此标记器看到的引号是空字符串。在

我认为，在NLP中识别引号并不常见。人们用很多不同的方式使用引号（特别是当你处理不同的语言时……），所以很难用一种健壮的方法来正确地使用引号。对于许多NLP应用程序引用只是被忽略了，我想说。。。在

相关问题更多 >

编程相关推荐

热门问题

热门文章