Python:使用df2.col2的值替换df1.col的值

2024-09-28 20:43:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧df1和df2。 在df1中我有50列,在df2中我有50+列。我想要的是 在df1中,我有13000行和一个列名subject,其中给出了所有主题的名称。 在df2中,我有250行,沿着50+我有两列分别命名为subject code和subject_name。在

        Here is an example of my datasets:

        df1 = 
        index     subjects
        0         Biology
        1         Physicss
        2         Chemistry
        3         Biology
        4         Physics
        5         Physics
        6         Biolgy

    df2 = 
        index     subject_name    subject_code
        0         Biology         BIO
        1         Physics         PHY
        2         Chemistry       CHE
        3         Medical         MED
        4         Programming     PRO
        5         Maths           MAT
        6         Literature      LIT 

My desired output in df1 (after replacing subject_name and fixing the spelling errors) is:
            index     subjects        subject_code
            0         Biology         BIO
            1         Physics         PHY
            2         Chemistry       CHE
            3         Biology         BIO
            4         Physics         PHY
            5         Physics         PHY
            6         Biology         BIO

最后,我希望将df1中的所有subject值与df2 subject name value中的值合并。在df1中,当我将两列合并为一列后,大约有500行得到NAN,因为在这500行中,主题的拼写有一些不同。 我尝试过在以下链接中给出的解决方案,但对我无效: replace df index values with values from a list but ignore empty strings

Python pandas: replace values multiple columns matching multiple columns from another dataframe

^{pr2}$

有谁能告诉我如何解决这个问题,因为我已经花了8个小时在这个问题上,但无法解决它。在

干杯


Tags: name主题indexisphycodebiosubject
2条回答

你的问题之一是拼写错误。您可以尝试使用difflib模块及其get_close_matches方法在dataframes之间协调主题的拼写。在

使用此代码将为df1df2中的每个匹配返回最接近的匹配主题。df1's列将被更新以反映这一点。因此,即使主题名称拼写不正确,它现在在dataframes中的拼写也将相同。在

import pandas as pd
import difflib

df2['subject_name'] = df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])

在此之后,您可以尝试合并。它可能会解决您的问题,但如果您提供一个可复制的示例,则更容易修复。在

更正拼写然后合并。。。在

import pandas as pd
import operator, collections

df1 = pd.DataFrame.from_items([("subjects",
                                ["Biology","Physicss","Phsicss","Chemistry",
                                 "Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
                                ["Biology","Physics","Chemistry","Medical",
                                 "Programming","Maths","Literature"]),
                               ("subject_code",
                                ["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])

找出拼写错误:

^{pr2}$

找到与拼写错误最匹配的主题并创建词典->;{mis_sp:subject_name}

difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
    '''Returns the most likely match for a misspelled word
    '''
    temp = []
    for candidate in candidates:
        count1 = collections.Counter(word)
        count2 = collections.Counter(candidate)
        diff1 = count1 - count2
        diff2 = count2 - count1
        diff = sum(diff1.values())
        diff += sum(diff2.values())
        temp.append((candidate, diff))
    return subject(min(temp, key = difference))

def foo2(words):
    '''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
    for word in words:
        name = foo1(word, df2.subject_name)
        if name:
            yield (word, name)

d = dict(foo2(misspelled))               

更正df1中的所有拼写错误

def foo3(thing):
    return d.get(thing, thing)

df3 = df1.applymap(foo3)

合并

df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')

foo1可能已经足够了,但是有更好、更复杂的算法来纠正拼写。也许,http://norvig.com/spell-correct.html

读一下康纳的解决方案。我不知道difflib在那里,所以foo1会更好

def foo1(word, candidates):
    try:
        return difflib.get_close_matches(word, candidates, 1)[0]
    except IndexError as e:
        # there isn't a close match
        return None

相关问题 更多 >