Python：使用df2.col2的值替换df1.col的值

Here is an example of my datasets: df1 = index subjects 0 Biology 1 Physicss 2 Chemistry 3 Biology 4 Physics 5 Physics 6 Biolgy df2 = index subject_name subject_code 0 Biology BIO 1 Physics PHY 2 Chemistry CHE 3 Medical MED 4 Programming PRO 5 Maths MAT 6 Literature LIT My desired output in df1 (after replacing subject_name and fixing the spelling errors) is: index subjects subject_code 0 Biology BIO 1 Physics PHY 2 Chemistry CHE 3 Biology BIO 4 Physics PHY 5 Physics PHY 6 Biology BIO

2条回答

网友

1楼 · 编辑于 2024-09-28 20:43:51

你的问题之一是拼写错误。您可以尝试使用difflib模块及其get_close_matches方法在dataframes之间协调主题的拼写。在

使用此代码将为df1和df2中的每个匹配返回最接近的匹配主题。df1's列将被更新以反映这一点。因此，即使主题名称拼写不正确，它现在在dataframes中的拼写也将相同。在

import pandas as pd
import difflib

df2['subject_name'] = df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])

在此之后，您可以尝试合并。它可能会解决您的问题，但如果您提供一个可复制的示例，则更容易修复。在

网友

2楼 · 编辑于 2024-09-28 20:43:51

更正拼写然后合并。。。在

import pandas as pd
import operator, collections

df1 = pd.DataFrame.from_items([("subjects",
                                ["Biology","Physicss","Phsicss","Chemistry",
                                 "Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
                                ["Biology","Physics","Chemistry","Medical",
                                 "Programming","Maths","Literature"]),
                               ("subject_code",
                                ["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])

找出拼写错误：

^{pr2}$

找到与拼写错误最匹配的主题并创建词典->；{mis_sp:subject_name}

difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
    '''Returns the most likely match for a misspelled word
    '''
    temp = []
    for candidate in candidates:
        count1 = collections.Counter(word)
        count2 = collections.Counter(candidate)
        diff1 = count1 - count2
        diff2 = count2 - count1
        diff = sum(diff1.values())
        diff += sum(diff2.values())
        temp.append((candidate, diff))
    return subject(min(temp, key = difference))

def foo2(words):
    '''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
    for word in words:
        name = foo1(word, df2.subject_name)
        if name:
            yield (word, name)

d = dict(foo2(misspelled))

更正df1中的所有拼写错误

def foo3(thing):
    return d.get(thing, thing)

df3 = df1.applymap(foo3)

合并

df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')

foo1可能已经足够了，但是有更好、更复杂的算法来纠正拼写。也许，http://norvig.com/spell-correct.html

读一下康纳的解决方案。我不知道difflib在那里，所以foo1会更好

def foo1(word, candidates):
    try:
        return difflib.get_close_matches(word, candidates, 1)[0]
    except IndexError as e:
        # there isn't a close match
        return None

相关问题更多 >

编程相关推荐

热门问题

热门文章