<p>更正拼写然后合并。。。在</p>
<pre><code>import pandas as pd
import operator, collections
df1 = pd.DataFrame.from_items([("subjects",
["Biology","Physicss","Phsicss","Chemistry",
"Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
["Biology","Physics","Chemistry","Medical",
"Programming","Maths","Literature"]),
("subject_code",
["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])
</code></pre>
<p>找出拼写错误:</p>
^{pr2}$
<p>找到与拼写错误最匹配的主题并创建词典->;{mis_sp:subject_name}</p>
<pre><code>difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
'''Returns the most likely match for a misspelled word
'''
temp = []
for candidate in candidates:
count1 = collections.Counter(word)
count2 = collections.Counter(candidate)
diff1 = count1 - count2
diff2 = count2 - count1
diff = sum(diff1.values())
diff += sum(diff2.values())
temp.append((candidate, diff))
return subject(min(temp, key = difference))
def foo2(words):
'''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
for word in words:
name = foo1(word, df2.subject_name)
if name:
yield (word, name)
d = dict(foo2(misspelled))
</code></pre>
<p>更正df1中的所有拼写错误</p>
<pre><code>def foo3(thing):
return d.get(thing, thing)
df3 = df1.applymap(foo3)
</code></pre>
<p>合并</p>
<pre><code>df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')
</code></pre>
<hr/>
<p><code>foo1</code>可能已经足够了,但是有更好、更复杂的算法来纠正拼写。也许,<a href="http://norvig.com/spell-correct.html" rel="nofollow">http://norvig.com/spell-correct.html</a></p>
<p>读一下康纳的解决方案。我不知道difflib在那里,所以<code>foo1</code>会更好</p>
<pre><code>def foo1(word, candidates):
try:
return difflib.get_close_matches(word, candidates, 1)[0]
except IndexError as e:
# there isn't a close match
return None
</code></pre>