Python：使用df2.col2的值替换df1.col的值问题的回答 - Python中文网

Python：使用df2.col2的值替换df1.col的值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

<p>我有两个数据帧df1和df2。在df1中我有50列，在df2中我有50+列。我想要的是在df1中，我有13000行和一个列名subject，其中给出了所有主题的名称。在df2中，我有250行，沿着50+我有两列分别命名为subject code和subject_name。在</p> <pre><code> Here is an example of my datasets: df1 = index subjects 0 Biology 1 Physicss 2 Chemistry 3 Biology 4 Physics 5 Physics 6 Biolgy df2 = index subject_name subject_code 0 Biology BIO 1 Physics PHY 2 Chemistry CHE 3 Medical MED 4 Programming PRO 5 Maths MAT 6 Literature LIT My desired output in df1 (after replacing subject_name and fixing the spelling errors) is: index subjects subject_code 0 Biology BIO 1 Physics PHY 2 Chemistry CHE 3 Biology BIO 4 Physics PHY 5 Physics PHY 6 Biology BIO </code></pre> <p>最后，我希望将df1中的所有subject值与df2 subject name value中的值合并。在df1中，当我将两列合并为一列后，大约有500行得到NAN，因为在这500行中，主题的拼写有一些不同。我尝试过在以下链接中给出的解决方案，但对我无效： <a href="https://stackoverflow.com/questions/34946913/replace-df-index-values-with-values-from-a-list-but-ignore-empty-strings">replace df index values with values from a list but ignore empty strings</a></p> <p><a href="https://stackoverflow.com/questions/31751230/python-pandas-replace-values-multiple-columns-matching-multiple-columns-from-an">Python pandas: replace values multiple columns matching multiple columns from another dataframe</a></p> ^{pr2}$ <p>有谁能告诉我如何解决这个问题，因为我已经花了8个小时在这个问题上，但无法解决它。在</p> <p>干杯</p>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p>更正拼写然后合并。。。在</p> <pre><code>import pandas as pd import operator, collections df1 = pd.DataFrame.from_items([("subjects", ["Biology","Physicss","Phsicss","Chemistry", "Biology","Physics","Physics","Biolgy","navelgazing"])]) df2 = pd.DataFrame.from_items([("subject_name", ["Biology","Physics","Chemistry","Medical", "Programming","Maths","Literature"]), ("subject_code", ["BIO","PHY","CHE","MED","PRO","MAT","LIT"])]) </code></pre> <p>找出拼写错误：</p> ^{pr2}$ <p>找到与拼写错误最匹配的主题并创建词典-&gt；{mis_sp:subject_name}</p> <pre><code>difference = operator.itemgetter(1) subject = operator.itemgetter(0) def foo1(word, candidates): '''Returns the most likely match for a misspelled word ''' temp = [] for candidate in candidates: count1 = collections.Counter(word) count2 = collections.Counter(candidate) diff1 = count1 - count2 diff2 = count2 - count1 diff = sum(diff1.values()) diff += sum(diff2.values()) temp.append((candidate, diff)) return subject(min(temp, key = difference)) def foo2(words): '''Yields (misspelled-word, corrected-word) tuples from misspelled words''' for word in words: name = foo1(word, df2.subject_name) if name: yield (word, name) d = dict(foo2(misspelled)) </code></pre> <p>更正df1中的所有拼写错误</p> <pre><code>def foo3(thing): return d.get(thing, thing) df3 = df1.applymap(foo3) </code></pre> <p>合并</p> <pre><code>df2 = df2.set_index("subject_name") df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left') </code></pre> <hr/> <p><code>foo1</code>可能已经足够了，但是有更好、更复杂的算法来纠正拼写。也许，<a href="http://norvig.com/spell-correct.html" rel="nofollow">http://norvig.com/spell-correct.html</a></p> <p>读一下康纳的解决方案。我不知道difflib在那里，所以<code>foo1</code>会更好</p> <pre><code>def foo1(word, candidates): try: return difflib.get_close_matches(word, candidates, 1)[0] except IndexError as e: # there isn't a close match return None </code></pre>