购买ram以避免3050gbplus文件的分块

2024-10-02 04:32:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用pandas读取非常大的csv文件,这些文件也是gzip文件。 我解压到csv文件约30-50GB。 我将文件分块并处理/操纵它们。 最后将相关数据添加到我压缩的HDF5文件中

它工作正常,但速度很慢,因为我每天要处理一个文件,并且有几年的数据(600TB未压缩csv)

购买更多的ram是避免内存块和加快进程(比如64GB/128GB)的好方法吗? 但这会不会让熊猫行动迟缓和笨拙呢? 我说的是切换到C++可以加快进程,但是我仍然遭受读取过程的影响,并且不得不处理块中的数据。 最后,有没有人对处理这个问题的最佳方法有什么想法。你知道吗

顺便说一句,一旦工作完成,我就不必再回去处理数据,所以我只想让它在合理的时间内工作,所以写一些东西,并行过程可能是不错的,但由于在该领域的经验有限,我需要一段时间来建设它,所以我不希望,除非这是唯一的选择。你知道吗

更新。我想看代码会更容易。我不相信代码在任何方面都特别慢。我认为技术/方法可能是可行的。你知道吗

def txttohdf(path, contract):
    #create dataframes for trade and quote
    dftrade = pd.DataFrame(columns = ["datetime", "Price", "Volume"])
    dfquote = pd.DataFrame(columns = ["datetime", "BidPrice", "BidSize","AskPrice", "AskSize"])
    #create an hdf5 file with high compression and table so we can append
    hdf = pd.HDFStore(path + contract + '.h5', complevel=9, complib='blosc')
    hdf.put('trade', dftrade, format='table', data_columns=True)
    hdf.put('quote', dfquote, format='table', data_columns=True)
    #date1 = date(start).strftime('%Y%m%d')
    #date2 = date(end).strftime('%Y%m%d')
    #dd = [date1 + timedelta(days=x) for x in range((date2-date1).days + 1)]
    #walkthrough directories
    for subdir, dir, files in os.walk(path):
        for file in files:
            #check if contract has name
            #print(file)
                #create filename from directory and file 

            filename = os.path.join(subdir, file)
                #read in csv
            if filename.endswith('.gz'):

                df = pd.read_csv(gzip.open(filename),header=0,iterator=True,chunksize = 10000, low_memory =False,  names = ['RIC','Date','Time','GMTOffset','Type','ExCntrbID','LOC','Price','Volume','MarketVWAP','BuyerID','BidPrice','BidSize','NoBuyers','SellerID','AskPrice','AskSize','NoSellers','Qualifiers','SeqNo','ExchTime','BlockTrd','FloorTrd','PERatio','Yield','NewPrice','NewVol','NewSeqNo','BidYld','AskYld','ISMABidYld','ISMAAskYld','Duration','ModDurtn','BPV','AccInt','Convexity','BenchSpd','SwpSpd','AsstSwpSpd','SwapPoint','BasePrice','UpLimPrice','LoLimPrice','TheoPrice','StockPrice','ConvParity','Premium','BidImpVol','AskImpVol','ImpVol','PrimAct','SecAct','GenVal1','GenVal2','GenVal3','GenVal4','GenVal5','Crack','Top','FreightPr','1MnPft','3MnPft','PrYrPft','1YrPft','3YrPft','5YrPft','10YrPft','Repurch','Offer','Kest','CapGain','Actual','Prior','Revised','Forecast','FrcstHigh','FrcstLow','NoFrcts','TrdQteDate','QuoteTime','BidTic','TickDir','DivCode','AdjClose','PrcTTEFlag','IrgTTEFlag','PrcSubMktId','IrgSubMktId','FinStatus','DivExDate','DivPayDate','DivAmt','Open','High','Low','Last','OpenYld','HighYld','LowYld','ShortPrice','ShortVol','ShortTrdVol','ShortTurnnover','ShortWeighting','ShortLimit','AccVolume','Turnover','ImputedCls','ChangeType','OldValue','NewValue','Volatility','Strike','Premium','AucPrice','Auc Vol','MidPrice','FinEvalPrice','ProvEvalPrice','AdvancingIssues','DecliningIssues','UnchangedIssues','TotalIssues','AdvancingVolume','DecliningVolume','UnchangedVolume','TotalVolume','NewHighs','NewLows','TotalMoves','PercentageChange','AdvancingMoves','DecliningMoves','UnchangedMoves','StrongMarket','WeakMarket','ChangedMarket','MarketVolatility','OriginalDate','LoanAskVolume','LoanAskAmountTradingPrice','PercentageShortVolumeTradedVolume','PercentageShortPriceTradedPrice','ForecastNAV','PreviousDaysNAV','FinalNAV','30DayATMIVCall','60DayATMIVCall','90DayATMIVCall','30DayATMIVPut','60DayATMIVPut','90DayATMIVPut','BackgroundReference','DataSource','BidSpread','AskSpread','ContractPhysicalUnits','Miniumumquantity','NumberPhysicals','ClosingReferencePrice','ImbalanceQuantity','FarClearingPrice','NearClearingPrice','OptionAdjustedSpread','ZSpread','ConvexityPremium','ConvexityRatio','PercentageDailyReturn','InterpolatedCDSBasis','InterpolatedCDSSpread','ClosesttoMaturityCDSBasis','SettlementDate','EquityPrice','Parity','CreditSpread','Delta','InputVolatility','ImpliedVolatility','FairPrice','BondFloor','Edge','YTW','YTB','SimpleMargin','DiscountMargin','12MonthsEPS','UpperTradingLimit','LowerTradingLimit','AmountOutstanding','IssuePrice','GSpread','MiscValue','MiscValueDescription'])
                #parse date time this is quicker than doing it while we read it in
                for chunk in df:
                    chunk['datetime'] = chunk.apply(lambda row: datetime.datetime.strptime(row['Date']+ ':' + row['Time'],'%d-%b-%Y:%H:%M:%S.%f'), axis=1)
                    #df = df[~df.comment.str.contains('ALIAS')]
                #drop uneeded columns inc date and time
                    chunk = chunk.drop(['Date','Time','GMTOffset','ExCntrbID','LOC','MarketVWAP','BuyerID','NoBuyers','SellerID','NoSellers','Qualifiers','SeqNo','ExchTime','BlockTrd','FloorTrd','PERatio','Yield','NewPrice','NewVol','NewSeqNo','BidYld','AskYld','ISMABidYld','ISMAAskYld','Duration','ModDurtn','BPV','AccInt','Convexity','BenchSpd','SwpSpd','AsstSwpSpd','SwapPoint','BasePrice','UpLimPrice','LoLimPrice','TheoPrice','StockPrice','ConvParity','Premium','BidImpVol','AskImpVol','ImpVol','PrimAct','SecAct','GenVal1','GenVal2','GenVal3','GenVal4','GenVal5','Crack','Top','FreightPr','1MnPft','3MnPft','PrYrPft','1YrPft','3YrPft','5YrPft','10YrPft','Repurch','Offer','Kest','CapGain','Actual','Prior','Revised','Forecast','FrcstHigh','FrcstLow','NoFrcts','TrdQteDate','QuoteTime','BidTic','TickDir','DivCode','AdjClose','PrcTTEFlag','IrgTTEFlag','PrcSubMktId','IrgSubMktId','FinStatus','DivExDate','DivPayDate','DivAmt','Open','High','Low','Last','OpenYld','HighYld','LowYld','ShortPrice','ShortVol','ShortTrdVol','ShortTurnnover','ShortWeighting','ShortLimit','AccVolume','Turnover','ImputedCls','ChangeType','OldValue','NewValue','Volatility','Strike','Premium','AucPrice','Auc Vol','MidPrice','FinEvalPrice','ProvEvalPrice','AdvancingIssues','DecliningIssues','UnchangedIssues','TotalIssues','AdvancingVolume','DecliningVolume','UnchangedVolume','TotalVolume','NewHighs','NewLows','TotalMoves','PercentageChange','AdvancingMoves','DecliningMoves','UnchangedMoves','StrongMarket','WeakMarket','ChangedMarket','MarketVolatility','OriginalDate','LoanAskVolume','LoanAskAmountTradingPrice','PercentageShortVolumeTradedVolume','PercentageShortPriceTradedPrice','ForecastNAV','PreviousDaysNAV','FinalNAV','30DayATMIVCall','60DayATMIVCall','90DayATMIVCall','30DayATMIVPut','60DayATMIVPut','90DayATMIVPut','BackgroundReference','DataSource','BidSpread','AskSpread','ContractPhysicalUnits','Miniumumquantity','NumberPhysicals','ClosingReferencePrice','ImbalanceQuantity','FarClearingPrice','NearClearingPrice','OptionAdjustedSpread','ZSpread','ConvexityPremium','ConvexityRatio','PercentageDailyReturn','InterpolatedCDSBasis','InterpolatedCDSSpread','ClosesttoMaturityCDSBasis','SettlementDate','EquityPrice','Parity','CreditSpread','Delta','InputVolatility','ImpliedVolatility','FairPrice','BondFloor','Edge','YTW','YTB','SimpleMargin','DiscountMargin','12MonthsEPS','UpperTradingLimit','LowerTradingLimit','AmountOutstanding','IssuePrice','GSpread','MiscValue','MiscValueDescription'], axis=1)
                # convert to datetime explicitly and add nanoseconds to same time stamps
                    chunk['datetime'] = pd.to_datetime(chunk.datetime)
                #nanoseconds = df.groupby(['datetime']).cumcount()
                #df['datetime'] += np.array(nanoseconds, dtype='m8[ns]')  
                # drop empty prints and make sure all prices are valid
                    dfRic = chunk[(chunk["RIC"] == contract)]
                    if len(dfRic)>0:
                        print(dfRic)
                    if ~chunk.empty:
                        dft = dfRic[(dfRic["Type"] == "Trade")]
                        dft.dropna(subset = ["Volume"], inplace =True)
                        dft = dft.drop(["RIC","Type","BidPrice", "BidSize", "AskPrice", "AskSize"], axis=1)
                        dft = dft[(dft["Price"] > 0)]

                    # clean up bid and ask
                        dfq = dfRic[(dfRic["Type"] == "Quote")]
                        dfq.dropna(how = 'all', subset = ["BidSize","AskSize"], inplace =True)
                        dfq = dfq.drop(["RIC","Type","Price", "Volume"], axis=1)
                        dfq = dfq[(dfq["BidSize"] > 0) | (dfq["AskSize"] > 0)]
                        dfq = dfq.ffill()
                    else:
                        print("Empty")    
    #add to hdf and close if loop finished
                    hdf.append('trade', dft, format='table', data_columns=True)
                    hdf.append('quote', dfq, format='table', data_columns=True)
    hdf.close()

Tags: columnsand文件csvintruedffor
1条回答
网友
1楼 · 发布于 2024-10-02 04:32:32

我认为你有很多东西可以优化:

  • 首先,只读取您真正需要的列,而不是读取然后删除它们-使用usecols=list_of_needed_columns参数

  • 增加你的chunksize-尝试不同的值-我会从10**5

  • 不要使用chunk.apply(...)来转换日期时间-这是非常慢的pd.to\ U日期时间(列,格式=“…”)改为

  • 在组合多个条件时,您可以更有效地过滤数据,而不是一步一步地进行:

相关问题 更多 >

    热门问题