如何使用循环在SQLite中创建真正大的表的副本

2024-09-27 23:24:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要使用python在SQLite中创建crsp.dsf的副本 crsp.dsf是一个非常非常大的表,建议我执行流式读取或编写循环,按日期或permno在crsp.dsf上迭代。最后,我应该验证我有相同数量的观察结果、唯一的日期、唯一的公司和(大致)总和(四舍五入(ret,5))

我的代码如下:

# import packages
import pandas as pd
import numpy as np
import mysql.connector
from dplython import (DplyFrame, X, diamonds, select, sift,
  sample_n, sample_frac, head, arrange, mutate, group_by,
  summarize, DelayFunction)


# def functions
def connect(db="crsp"):
    conn = mysql.connector.connect(host="178.128.92.75", 
                           port=3306,user="student_temp_user",
                           passwd="temp_userYxL",
                           db=db)
    return conn

def queryMySQL(x,db='crsp'):
    conn=connect(db=db)
    response = pd.read_sql(x, conn)
    conn.close()
    return response


# write a loop that iterates on crsp.dsf by permno
permnos = queryMySQL('select distinct permno from crsp.msf').loc[:,'permno']
permnos.unique()

i=0
for item in permnos.unique():
    result=queryMySQL('select * ' +
                      'from crsp.msf where permno = ' + str(item) + ' limit 5')
    i=i+1

    if i==1:
        msf=pd.DataFrame(result)
    if(i>1):
        msf = msf.append(pd.DataFrame(result))
    if i==83436:
        break # take long time

msf 


# write my own SQL table
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://student_temp_user:temp_userYxL@178.128.92.75/student_temp_user")
msf.to_sql(name='msf_copy',con=engine, if_exists='append', index=False)


# verification
print(queryMySQL('select count(*), count(distinct date), count(distinct permno), sum(round(ret,5)) from student_temp_user.msf_copy')
print(queryMySQL('select count(*), count(distinct date), count(distinct permno), sum(round(ret,5)) from crsp.msf')

在循环中,我超过了8万,我想这就是为什么花了这么长时间。如何改进代码?也许我的代码有问题


Tags: fromimportdbcountconnselecttemppd

热门问题