将1100万行从Postgresql导入Pandas/Python

2024-10-19 14:15:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从托管在AWS服务器上的PostgreSQL数据库加载1100万条记录。我尝试过使用pandas read_sql,我在4小时内就得到了结果。我的笔记本电脑上有32 GB的RAM,还有Core i7,第7代。我还将块大小设置为10000,但这并不能缩短疯狂的时间。我在网上看了很多文章,都试过了,但没有一篇能加快我的进程。如果可能的话,我希望在20分钟内加载这些数据,或者在尽可能短的时间内加载。我需要数据帧中的这些数据,这样我就可以与我拥有的其他文件进行一些合并,如果我可以用Python获取数据,我就可以自动化我的过程。我的代码如下所示:

from io import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd
import numpy as np
import time


connection = psycopg2.connect(user="abc",
                                      password="efg",
                                      host="123.amazonaws.com",
                                      port="5432",
                                      database="db")

date='2020-03-01'
columns= '"LastName","FirstName","DateOfBirth","PatientGender","Key"'

postgreSQL_select_Query = 'select ' +  columns + ' from "Table" where "CreatedDate"::date>=' + "'" + date + "'" + 'limit 11000000'


x=pd.read_sql_query(postgreSQL_select_Query, connection, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=10000)

请建议我可以做些什么来改进这段代码,并减少运行时间

我还附加了另一个代码段,我正在使用它来执行此操作,但其结果与在小时内获取行的结果相同。任何指导都将不胜感激

第二种方法:

# -*- coding: utf-8 -*-

@author: ssullah
"""
from io import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd
import numpy as np
import time

start = time.time()
print("Started")

#Retreiving records from DB
def getdata():  
    try:
        start = time.time()
        print("Started")
        connection = psycopg2.connect(user="a"
                                      password="as",
                                      host="aws",
                                      port="5432",
                                      database="as")


        cur= connection.cursor()

        date='2020-03-01'
        columns= '"LastName","FirstName","DateOfBirth","PatientGender","Key"'

        postgreSQL_select_Query = 'select ' +  columns + ' from "ALLADTS" where "CreatedDate"::date>=' + "'" + date + "'" + 'limit 11000000'

        cur = connection.cursor('cursor-name') # server side cursor
        cur.itersize = 10000 # how much records to buffer on a client
        cur.execute(postgreSQL_select_Query)

        mobile_records = cur.fetchall() 


    #Column names as per schema, defined above
        col_names=["LastName","FirstName","DateOfBirth","PatientGender","Key"]

    # Create the dataframe, passing in the list of col_names extracted from the description
        records = pd.DataFrame(mobile_records,col_names)

        return records;


    except (Exception, psycopg2.Error) as error :
        print ("Error while fetching data from PostgreSQL", error)

    finally:
        #closing database connection.
        if(connection):
            cursor.close()
            connection.close()
            print("PostgreSQL connection is closed")


records=getdata()
end = time.time()
print("The total time:", (end - start)/60, 'minutes')

Tags: columnsfromimportsqldatetimeasconnection
1条回答
网友
1楼 · 发布于 2024-10-19 14:15:48

更新:

我没有用Python加载数据,而是决定用Python在postgresql中创建一个临时表,并将新文件从pandas加载到postgresql。一旦使用python中的查询填充了表,我就能够查询并获得所需的输出,并将最终结果返回到panda数据帧中

所有这些都需要1.4分钟,同样的查询需要30分钟才能在Pgadmin中运行,因此,通过利用Python,并使用Python编写的sql查询进行计算,我能够以指数方式加快过程,同时不必处理内存中的1100万条记录。谢谢你的建议

相关问题 更多 >