<p>我建议在这里使用NLP方法,因为我不知道正则表达式如何区分<code>nyears</code>(错误拼写)和<code>new</code>(正确拼写)</p>
<p>首先,删除所有独立的<code>r</code>/<code>n</code>和那些粘在大写单词和数字上的,然后拆分字符串并用拼写检查器检查以<code>n</code>或<code>r</code>开头的每个单词。如果<code>word[1:]</code>正确而<code>word</code>不正确,则可以删除第一个<code>n</code>。如果两者都不正确,我认为回到<code>word</code>是安全的</p>
<p>例如,要运行拼写检查,可以使用<a href="https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction" rel="nofollow noreferrer">^{<cd11>}</a></p>
<p>下面是一个Python代码演示:</p>
<pre><code>from textblob import TextBlob
from textblob import Word
import re
s = "r n r n r nFamily Medical History new r n r n r r r Roger nRobert n nDawson n49 nyears old , right shoulder"
s = re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
result = []
for w in s.split():
if not w.startswith(('n','r')): # The w word does not start with n or r...
result.append(w) # Add it to the result
else:
if Word(w).correct() == w: # If w is a correct word
result.append(w) # Add it to the result
else:
if Word(w[1:]).correct() == w[1:]: # If w[1:] is correct
result.append(w[1:]) # Add w[1:] to the result
else:
result.append(w) # Fallback: add w to the result
print(" ".join(result))
# => Family Medical History new Roger Robert Dawson 49 years old , right shoulder
</code></pre>
<p>如果紧跟在大写字母、数字或字符串末尾,则<code>re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)</code>部分删除单词开头的<code>r</code>和<code>n</code></p>
<p>然后,<code>for w in s.split():</code>迭代句子中的单词,并仅在单词以<code>n</code>或<code>r</code>开头且拼写错误为<code>w[1:]</code>时替换该单词</p>
<p><strong>免责声明</strong>:<code>TextBlob</code>用作示例。您可以自由使用任何其他拼写检查库<a href="https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction" rel="nofollow noreferrer">TextBlob spellchecking</a>“<em>基于Peter Norvig在模式库中实现的“如何编写拼写更正器”<a href="https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction" rel="nofollow noreferrer">1</a>。它的准确率约为70%</em>”</p>