擅长:python、mysql、java
<p>这将解决您当前的示例。它可以调整为更大的数据集。在</p>
<pre><code>import re
splitterForIndexing = re.compile(r"(?:[a-zA-Z0-9\-,]+[a-zA-Z0-9\-])|(?:[,.])")
source = "Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine is worse."
print "\n".join( splitterForIndexing.findall(source))
</code></pre>
<p>结果是:</p>
^{pr2}$
<p>对不起,没看到非常糟糕。如果有必要把这些词分开。。在</p>
<pre><code>import re
splitterForIndexing = re.compile(r"(?:[a-zA-Z]+)|(?:[a-zA-Z0-9][a-zA-Z0-9\-(),]+[a-zA-Z0-9\-()])|(?:[,.-])")
source = "Hello. 1-methyl-4-phenylpyridinium is ultra-bad. However, 1-methyl-4-phenyl-1,(2,3),6-tetrahydropyridine is worse."
print "\n".join( splitterForIndexing.findall(source))
</code></pre>
<p>给出:</p>
<pre><code>"""
Hello
.
1-methyl-4-phenylpyridinium
is
ultra
-
bad
.
However
,
1-methyl-4-phenyl-1,(2,3),6-tetrahydropyridine
is
worse
.
"""
</code></pre>