擅长:python、mysql、java
<p>使用<code>spacy</code>的粗糙解。它已经可以很好地使用标记化单词了。在</p>
<pre><code>import spacy
s = 'hel-lo this has whi(.)te, space. very \n good'
nlp = spacy.load('en')
ls = [t.text for t in nlp(s) if t.text.strip()]
>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']
</code></pre>
<p>然而,它也标记了<code>-</code>之间的单词,所以我借用了<a href="https://stackoverflow.com/questions/43550219/merge-elements-in-list-based-on-given-indices">here</a>的解决方案,将<code>-</code>之间的单词重新合并在一起。在</p>
^{pr2}$