我有两个excel文件。你知道吗
这些文件唯一的共同点就是dbsid。你知道吗
在第一个excel(SQL)中,dbsid称为“样本卡ID”,在另一个excel(EMEA)中,dbsid称为“条形码”
import pandas as pd
excel_file = "eu-tracker.xlsx"
sql = pd.read_excel(excel_file, sheet_name=0, date_parser=True)
emea = pd.read_excel(excel_file, sheet_name=1, date_parser=True)
sql.drop_duplicates(inplace=True)
emea.drop_duplicates(inplace=True)
data = pd.merge(left=sql, right=emea, left_on="ID of Sample Card", right_on="Barcode", how="left")
SQL数据帧:
"OrderID" "Creation Date" "User ID" "Days in Lab" "Gender" "Sample Date" "ID of Sample Card" "System Sample ID" "OrderStatus" "Sample Received" ...
493 1234 10.11.1900 20202 3 Male 10.11.1900 5050123 1234 REPORT_AVAILABLE 13.11.1900 ...
EMEA数据帧:
"Barcode" "Eingangsdatum" "Befunddatum "Befunddatum "Befunddatum "Biochemie "Biochemie "Ergebnis "Biochemistry "Diagnosis" "Diagnosis_2" "Labornumber" "Age" "Sex"
Biochemie" Biochemie2" Lyso-GL-1" Ergebnis" Ergebnis2" Lyso-GL-1" report"
3123 5050123 13.11.1900 22.11.1900 22.11.1900 23.01.1900 0,178852201 20,11343324 165,4 aberrant Gaucher Niemann Pick 184094 65 M
预期数据帧:
"OrderID" "Creation Date" "User ID" "Days in Lab" "Gender" "Sample Date" "ID of Sample Card" "System Sample ID" "OrderStatus" "Sample Received" ... "Eingangsdatum" "Befunddatum "Befunddatum "Befunddatum "Biochemie "Biochemie "Ergebnis "Biochemistry "Diagnosis" "Diagnosis_2" "Labornumber" "Age" "Sex"
Biochemie" Biochemie2" Lyso-GL-1" Ergebnis" Ergebnis2" Lyso-GL-1" report"
493 1234 10.11.1900 20202 3 Male 10.11.1900 5050123 1234 REPORT_AVAILABLE 13.11.1900 ... 13.11.1900 22.11.1900 22.11.1900 23.01.1900 0,178852201 20,11343324 165,4 aberrant Gaucher Niemann Pick 184094 65 M
结果得到的数据帧:
"OrderID" "Creation Date" "User ID" "Days in Lab" "Gender" "Sample Date" "ID of Sample Card" "System Sample ID" "OrderStatus" "Sample Received" ... "Eingangsdatum" "Befunddatum "Befunddatum "Befunddatum "Biochemie "Biochemie "Ergebnis "Biochemistry "Diagnosis" "Diagnosis_2" "Labornumber" "Age" "Sex"
Biochemie" Biochemie2" Lyso-GL-1" Ergebnis" Ergebnis2" Lyso-GL-1" report"
493 1234 10.11.1900 20202 3 Male 10.11.1900 5050123 1234 REPORT_AVAILABLE 13.11.1900 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SQL数据帧信息:
RangeIndex: 2443 entries, 0 to 2442
Data columns (total 64 columns):
OrderID 2443 non-null float64
Creation Date 2443 non-null datetime64[ns]
User ID 2443 non-null float64
Days in Lab 2443 non-null object
Gender 2443 non-null object
Sample Date 2443 non-null datetime64[ns]
ID of Sample Card 2443 non-null object
System Sample ID 2443 non-null float64
OrderStatus 2443 non-null object
Sample Received 2443 non-null object
dtypes: datetime64[ns](2), float64(3), int64(41), object(18)
memory usage: 1.2+ MB
Emea数据帧信息:
RangeIndex: 3134 entries, 0 to 3133
Data columns (total 14 columns):
Barcode 3134 non-null object
Eingangsdatum 3134 non-null datetime64[ns]
Befunddatum Biochemie 2973 non-null object
Befunddatum Biochemie2 1413 non-null object
Befunddatum Lyso-GL-1 151 non-null object
Biochemie Ergebnis 2973 non-null float64
Biochemie Ergebnis2 1476 non-null float64
Ergebnis Lyso-GL-1 151 non-null float64
Biochemistry report 3134 non-null object
Diagnosis 2972 non-null object
Diagnosis_2 1475 non-null object
Labornummer 3134 non-null object
Alter 3134 non-null int64
Sex 3134 non-null object
dtypes: datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 342.9+ KB
在这些步骤之后,文件有更多的头文件,而没有来自另一个文件的数据。我也试过加入,但效果不太好。你知道吗
我不知道该怎么把这两者结合起来。你知道吗
问题是,这两个系列的对象类型。你知道吗
将两个级数都转换为整数
在那之后,我可以毫无问题地合并它们
与上述答案不同的是,序列中所有非数字字段都将为NaN
sql.["ID of Sample Card"]
和emea.["Barcode"]
都是object
数据类型。我无法从原始问题中的示例数据确定它们是否有前导空格或尾随空格,但这可能会导致连接两个数据帧时出错,即使数据看起来相同。你知道吗如果您确信这两个列都是数字且不为空,那么可以使用
astype
将它们转换为整数,但您可能需要首先清理数据。例如:相关问题 更多 >
编程相关推荐