如何通过比较两个数据帧来找出缺少的值

2024-09-27 22:21:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我只想计算两个数据帧之间缺少的值,所以。。。。 这是我试过的代码,效果很好

import pandas as pd


df1 = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["my_column"])
df2 = pd.DataFrame([1, 2, 3], columns=["my_column"])

result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()


print(result)

输出:

   my_column
3          4
4          5
5          6

所以它在静态数据帧上运行良好

但是当我使用基于sql的代码时,我发现了一个问题: 这是我的全部代码:

import pyodbc
import pandas as pd
import os
import sqlalchemy as db
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, Date, Float
import datetime as dt

# connect db
engine = create_engine('mssql+pyodbc://xxxxxxxxxx\SMARTRNO_EXPRESS/myDB?driver=SQL+Server+Native+Client+11.0')
connection = engine.connect()


esn_datafeed_query = 'SELECT * FROM [myDB].[dbo].[esn_datafeed]'
esn_inter_intra_query = 'SELECT * FROM [esn_inter_intra_merge]'

esn_datafeed_df = pd.read_sql(esn_datafeed_query ,engine)
esn_inter_intra_merge_df = (esn_inter_intra_query, engine)

df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])

result = df1[~df1.set_index(list(df1)).index.isin(df2.set_index(list(df2)).index)].dropna()


print(result)

所以前面的代码显示了所有的值,我不需要这个。。。我只想显示缺少的值。。。。我尝试了不同的方法,代码如下:

esn_datafeed_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_datafeed]', engine)
esn_inter_intra_merge_df = pd.read_sql('SELECT * FROM [myDB].[dbo].[esn_inter_intra_merge]', engine)

df1 = pd.DataFrame(esn_datafeed_df, columns=["st_umts_df_relation_key"])
df2 = pd.DataFrame(esn_inter_intra_merge_df, columns=["st_umts_esn_inter_intra_relation_key"])

merged = df1.merge(df2 , how="left", indicator=True)
result = merged.query("_merge == 'left_only'")[["st_umts_df_relation_key"]]

print(result)

但我有一个错误:

Traceback (most recent call last):
  File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 18, in <module>
    merged = df1.merge(df2 , how="left", indicator=True)
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
    return merge(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
    op = _MergeOperation(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
    self._validate_specification()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
    raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

编辑

我还尝试了以下代码:

df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

但我发现这个错误:

Traceback (most recent call last):
  File "C:/Users/haroo501/PycharmProjects/tool_check_nbr/my_missing_result.py", line 23, in <module>
    df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\frame.py", line 7336, in merge
    return merge(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 68, in merge
    op = _MergeOperation(
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 619, in __init__
    self._validate_specification()
  File "C:\Users\haroo501\PycharmProjects\tool_check_nbr\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1183, in _validate_specification
    raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

为了简单地解释一下我的数据库,我有两个表

esn_datafeedesn_datafeed

这是第二个表esn_inter_intra_merge

st_umts_esn_inter_intra_relation_key

现在我想计算两个表之间的差异,我需要计算esn_datafeed.st_umts_df_relation_key中的值,而不是esn_inter_intra_merge.st_umts_esn_inter_intra_relation_key

所以任何人都知道如何解决这个问题。。。。。 可能是因为数据库中有大量数据

有没有办法处理查询,这样就可以了


Tags: pandasdfindexmergeusersfilepddf1
1条回答
网友
1楼 · 发布于 2024-09-27 22:21:58

我认为问题在于您的新数据帧对列使用了不同的名称。然而,听起来你无论如何都应该使用集合。下面是如何获取两列之间的值之间的symmetric difference

missing_values = set(df1.iloc[:, 0]).symmetric_difference(set(df2.iloc[:, 0]))
>>> missing_values
{4, 5, 6}

然后,您可以检查数据帧值是否在这些缺少的值中

>>> df1[df1.iloc[:, 0].isin(missing_values)]
   my_column
3          4
4          5
5          6

编辑

经过进一步思考,这难道不是一个与熊猫无关的SQL问题吗

这样行吗?此SQL查询从t1esn_datafeed)中选择所有记录,其中t2esn_inter_intra_merge)的st_umts_esn_inter_intra_relation_key列中没有相应的st_umts_df_relation_key

SELECT * 
FROM esn_datafeed AS t1
LEFT JOIN esn_inter_intra_merge AS t2
ON t1.st_umts_df_relation_key = t2.st_umts_esn_inter_intra_relation_key
WHERE t2.st_umts_esn_inter_intra_relation_key IS NULL

相关问题 更多 >

    热门问题