合并两个excel文件时出现问题

2024-09-28 20:20:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个excel文件。你知道吗

这些文件唯一的共同点就是dbsid。你知道吗

在第一个excel(SQL)中,dbsid称为“样本卡ID”,在另一个excel(EMEA)中,dbsid称为“条形码”

import pandas as pd

excel_file = "eu-tracker.xlsx"
sql = pd.read_excel(excel_file, sheet_name=0, date_parser=True)
emea = pd.read_excel(excel_file, sheet_name=1, date_parser=True)

sql.drop_duplicates(inplace=True)
emea.drop_duplicates(inplace=True)

data = pd.merge(left=sql, right=emea, left_on="ID of Sample Card", right_on="Barcode", how="left")

SQL数据帧:

      "OrderID"   "Creation Date"   "User ID"   "Days in Lab"   "Gender"    "Sample Date"   "ID of Sample Card" "System Sample ID"  "OrderStatus"       "Sample Received"       ...
493     1234         10.11.1900      20202           3           Male        10.11.1900          5050123            1234             REPORT_AVAILABLE       13.11.1900          ...

EMEA数据帧:

        "Barcode"   "Eingangsdatum" "Befunddatum "Befunddatum     "Befunddatum  "Biochemie   "Biochemie     "Ergebnis   "Biochemistry "Diagnosis"   "Diagnosis_2"   "Labornumber"   "Age"   "Sex"
                                 Biochemie"   Biochemie2"     Lyso-GL-1"    Ergebnis"    Ergebnis2"     Lyso-GL-1"  report"     
3123     5050123     13.11.1900      22.11.1900   22.11.1900       23.01.1900   0,178852201   20,11343324     165,4     aberrant        Gaucher      Niemann Pick       184094       65       M

预期数据帧:

         "OrderID"  "Creation Date"   "User ID" "Days in Lab"   "Gender"    "Sample Date"   "ID of Sample Card" "System Sample ID"  "OrderStatus"       "Sample Received"       ...     "Eingangsdatum" "Befunddatum    "Befunddatum    "Befunddatum    "Biochemie      "Biochemie     "Ergebnis   "Biochemistry "Diagnosis"    "Diagnosis_2"   "Labornumber"   "Age"   "Sex"
                                                                                                                                                                                                    Biochemie"       Biochemie2"     Lyso-GL-1"      Ergebnis"       Ergebnis2"     Lyso-GL-1"  report"     
493        1234      10.11.1900        20202          3          Male        10.11.1900          5050123            1234             REPORT_AVAILABLE       13.11.1900          ...     13.11.1900       22.11.1900      22.11.1900      23.01.1900      0,178852201     20,11343324     165,4      aberrant       Gaucher       Niemann Pick      184094        65       M

结果得到的数据帧:

        "OrderID"   "Creation Date" "User ID"   "Days in Lab"   "Gender"    "Sample Date"   "ID of Sample Card" "System Sample ID"  "OrderStatus"       "Sample Received"       ...     "Eingangsdatum" "Befunddatum    "Befunddatum    "Befunddatum    "Biochemie      "Biochemie     "Ergebnis   "Biochemistry "Diagnosis"    "Diagnosis_2"   "Labornumber"   "Age"   "Sex"
                                                                                                                                                                                                    Biochemie"       Biochemie2"     Lyso-GL-1"      Ergebnis"       Ergebnis2"     Lyso-GL-1"  report"     
493     1234        10.11.1900       20202           3          Male         10.11.1900          5050123            1234             REPORT_AVAILABLE       13.11.1900          ...         NaN             NaN              NaN            NaN             NaN             NaN           NaN         NaN           NaN             NaN             NaN          NaN     NaN

SQL数据帧信息:

RangeIndex: 2443 entries, 0 to 2442
Data columns (total 64 columns):
OrderID                                                                      2443 non-null float64
Creation Date                                                                2443 non-null datetime64[ns]
User ID                                                                      2443 non-null float64
Days in Lab                                                                  2443 non-null object
Gender                                                                       2443 non-null object
Sample Date                                                                  2443 non-null datetime64[ns]
ID of Sample Card                                                            2443 non-null object
System Sample ID                                                             2443 non-null float64
OrderStatus                                                                  2443 non-null object
Sample Received                                                              2443 non-null object
dtypes: datetime64[ns](2), float64(3), int64(41), object(18)
memory usage: 1.2+ MB

Emea数据帧信息:

RangeIndex: 3134 entries, 0 to 3133
Data columns (total 14 columns):
Barcode                   3134 non-null object
Eingangsdatum             3134 non-null datetime64[ns]
Befunddatum Biochemie     2973 non-null object
Befunddatum Biochemie2    1413 non-null object
Befunddatum Lyso-GL-1     151 non-null object
Biochemie Ergebnis        2973 non-null float64
Biochemie Ergebnis2       1476 non-null float64
Ergebnis Lyso-GL-1        151 non-null float64
Biochemistry report       3134 non-null object
Diagnosis                 2972 non-null object
Diagnosis_2               1475 non-null object
Labornummer               3134 non-null object
Alter                     3134 non-null int64
Sex                       3134 non-null object
dtypes: datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 342.9+ KB

在这些步骤之后,文件有更多的头文件,而没有来自另一个文件的数据。我也试过加入,但效果不太好。你知道吗

我不知道该怎么把这两者结合起来。你知道吗


Tags: sampleiddateobjectnanexcelnullnon
2条回答

问题是,这两个系列的对象类型。你知道吗

将两个级数都转换为整数

sql["ID of Sample Card"] = pd.to_numeric(sql["ID of Sample Card"], errors="coerce", downcast="integer")
emea["Barcode"] = pd.to_numeric(emea["Barcode"], errors="coerce", downcast="integer")

在那之后,我可以毫无问题地合并它们

data = pd.merge(left=sql, right=emea, left_on="ID of Sample Card", right_on="Barcode", how="left")

与上述答案不同的是,序列中所有非数字字段都将为NaN

sql.["ID of Sample Card"]emea.["Barcode"]都是object数据类型。我无法从原始问题中的示例数据确定它们是否有前导空格或尾随空格,但这可能会导致连接两个数据帧时出错,即使数据看起来相同。你知道吗

如果您确信这两个列都是数字且不为空,那么可以使用astype将它们转换为整数,但您可能需要首先清理数据。例如:

sql["ID of Sample Card"] = sql["ID of Sample Card"].str.strip().astype('int')
emea["Barcode"] = emea["Barcode"].str.strip().astype('int')

相关问题 更多 >