在两列比较中查找唯一字符

2024-10-01 04:57:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我想比较column1和column2,并从column1得到唯一的值,从而检测出差异。所以在这个例子中,我应该得到的答案是'Residence-Location'、'12'、'NAN'和'NA'for空的。是的比较第一列和第二列

另外,我们可以创建结果并将其存储在另一列中吗?你知道吗

Result
index   column1         column2                     diff
1.      Admission Date  Residence - Location        Residence - Location
2.      Malnutrition    Malnutrition-12             -12
3.      TB              NAN                         NAN
4.      Anaemia         NA                          NA

代码可以是R或Python格式。我不介意

def FindDifference(Row):
    x = Row['column1']
    y = Row['column2']

    Difference = ""
    if pd.isnull(y) or y=="nan" or y=="NA":
        return NaN
    if len(x) <= len(y):
        for i in y:
            if i not in x:
                Difference += str(i)
    else:
        for i in x:
            if i not in y:
                Difference += str(i)
    return Difference

ReadDataT = Final_df[['column1','column2']] 
ReadDataT['diff']= ReadDataT.apply(lambda x: FindDifference(x),axis=1)
ReadDataT

这段代码的问题是,它比较了两个字符之间的每个字符,并给出了不只是在两列中的字符的结果……比如说,第一行给出了“RC Lc”作为差分


Tags: inforifdifflocationnan字符row
3条回答

对于Python:

df = df.replace(np.nan, '', regex = True)
df['diff'] = df.apply(lambda x: x['column2'].replace(x['column1'], '').strip(), axis = 1)
df = df.replace('', np.nan, regex = True)

输出:

          column1               column2                  diff
0  Admission Date  Residence - Location  Residence - Location
1    Malnutrition       Malnutrition-12                   -12
2              TB                   NaN                   NaN
3         Anaemia                   NaN                   NaN

在baser中,我们可以使用submapply

df$diff <- mapply(function(x, y) sub(x, "", y), df$column1, df$column2)

df
#  index        column1              column2                 diff
#1     1 Admission Date Residence - Location Residence - Location
#2     2   Malnutrition      Malnutrition-12                  -12
#3     3             TB                  NAN                  NAN
#4     4        Anaemia                 <NA>                 <NA>
library(dplyr); library(stringr)
df %>% mutate(diff = str_remove(column2, column1))

  index        column1              column2                 diff
1     1 Admission Date Residence - Location Residence - Location
2     2   Malnutrition      Malnutrition-12                  -12
3     3             TB                  NAN                  NAN
4     4        Anaemia                 <NA>                 <NA>

编辑:相同w/o dplyr

df$diff = stringr::str_remove(df$column2, df$column1)

相关问题 更多 >