我的Postgres查询如何执行得更快?我可以使用Python来提供更快的迭代吗?

2024-10-01 00:18:33 发布

您现在位置:Python中文网/ 问答频道 /正文

这是一个由两部分组成的问题。如果你正在检查这个,谢谢你的时间!你知道吗

  1. 有没有办法让我的查询更快?你知道吗

    我以前问过一个问题here,最终我自己解决了这个问题。你知道吗

    但是,我设计的用于生成所需结果的查询在对包含40000多条记录的数据库运行时非常慢(25分钟以上)。你知道吗

    这个查询是为了达到它的目的,但是我希望你们中的一个聪明人能告诉我如何使查询以更理想的速度执行。你知道吗

    我的问题:

    with dupe as (
        select
             json_document->'Firstname'->0->'Content' as first_name,
             json_document->'Lastname'->0->'Content' as last_name,
             identifiers->'RecordID' as record_id
        from (
            select *,  
                   jsonb_array_elements(json_document->'Identifiers') as identifiers
            from staging
        ) sub
        group by record_id, json_document
        order by last_name
    ) 
    
    select * from dupe da where (
      select count(*) from dupe db 
      where db.record_id = da.record_id
    ) > 1;
    

    同样,一些示例数据:

    第1行:

    {
            "Firstname": "Bobb",
            "Lastname": "Smith",
            "Identifiers": [
                {
                    "Content": "123",
                    "RecordID": "123",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-12T02:23:30.817Z"
                },
                {
                    "Content": "abc",
                    "RecordID": "abc",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-13T10:10:21.598Z"
                },
                {
                    "Content": "def",
                    "RecordID": "def",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-13T10:10:21.598Z"
                }
            ]
    }
    

    第2行:

    {
            "Firstname": "Bob",
            "Lastname": "Smith",
            "Identifiers": [
                {
                    "Content": "abc",
                    "RecordID": "abc",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-13T10:10:26.020Z"
                }
            ]
    }
    
  2. 如果我要将查询的结果或部分结果引入Python环境中,在那里可以使用Pandas对它们进行操作,那么我如何迭代查询(或子查询)的结果以获得与原始查询相同的最终结果?你知道吗

    有没有一种更简单的方法,使用Python,像Postgres那样迭代我的非嵌套json数组?你知道吗

    例如,执行此查询后:

    select
        json_document->'Firstname'->0->'Content' as first_name,
        json_document->'Lastname'->0->'Content' as last_name,
        identifiers->'RecordID' as record_id
    from (
           select *,  
                  jsonb_array_elements(json_document->'Identifiers') as identifiers
           from staging
         ) sub
    order by last_name;
    

    如何使用Python/Pandas获取该查询的结果并执行以下操作:

    da = datasets[query_results]  # to equal my dupe da query
    db = datasets[query_results]  # to equal my dupe db query
    

    然后执行

    select * from dupe da where (
        select count(*) from dupe db 
        where db.record_id = da.record_id
    ) > 1;
    

    在Python中?

如果我没有在这里提供足够的信息,我很抱歉。我是一个Python新手。非常感谢您的帮助!谢谢!!你知道吗


Tags: namefromidjsondbascontentfirstname
2条回答

考虑读取postgresjson列类型的原始、未查询的值,并使用pandas^{}绑定到平面数据帧中。从那里使用熊猫^{}。你知道吗

为了演示,下面为每个对应的标识符记录将一个json数据解析为三行数据帧:

import json
import pandas as pd

json_str = '''
{
        "Firstname": "Bobb",
        "Lastname": "Smith",
        "Identifiers": [
            {
                "Content": "123",
                "RecordID": "123",
                "SystemID": "Test",
                "LastUpdated": "2017-09-12T02:23:30.817Z"
            },
            {
                "Content": "abc",
                "RecordID": "abc",
                "SystemID": "Test",
                "LastUpdated": "2017-09-13T10:10:21.598Z"
            },
            {
                "Content": "def",
                "RecordID": "def",
                "SystemID": "Test",
                "LastUpdated": "2017-09-13T10:10:21.598Z"
            }
        ]
}
'''

data = json.loads(json_str)    
df = pd.io.json.json_normalize(data, 'Identifiers', ['Firstname','Lastname'])

print(df)    
#   Content               LastUpdated RecordID SystemID Lastname Firstname
# 0     123  2017-09-12T02:23:30.817Z      123     Test    Smith      Bobb
# 1     abc  2017-09-13T10:10:21.598Z      abc     Test    Smith      Bobb
# 2     def  2017-09-13T10:10:21.598Z      def     Test    Smith      Bobb

对于您的数据库,请考虑连接DB-API,例如psycopg2sqlAlchemy,并相应地将每个json解析为一个字符串。诚然,可能还有其他方法来处理json,如psycopg2 docs中所示,但下面将以文本形式接收数据并在python端进行解析:

import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")

cur = conn.cursor()    
cur.execute("SELECT json_document::text FROM staging;")

df = pd.io.json.json_normalize([json.loads(row[0]) for row in cur.fetchall()], 
                               'Identifiers', ['Firstname','Lastname'])

df = df.drop_duplicates(['RecordID'])

cur.close()
conn.close()

请尝试以下操作,这样可以消除count(*),而使用exists。你知道吗

 with dupe as ( 
   select id, 
     json_document->'Firstname'->0->'Content' as first_name, 
     json_document->'Lastname'->0->'Content' as last_name, 
     identifiers->'RecordID' as record_id 
   from 
     (select 
       *, 
       jsonb_array_elements(json_document->'Identifiers') as identifiers 
      from staging ) sub 
      group by
        id,
        record_id, 
        json_document 
      order by last_name ) 
 select * from dupe da 
   where exists 
     (select * 
       from dupe db 
       where 
         db.record_id = da.record_id 
         and db.id != da.id
     )

相关问题 更多 >