Python Pandas查找另一行+计算字段(矢量化方式!)

2024-10-02 14:27:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框“dfSummary”-

exchangeBalances = [['ETHBTC','binance',10], ['LTCBTC','binance',10], ['XRPBTC','binance',10], ['ETHBTC','bitfinex',10], ['LTCBTC','bitfinex',10], ['XRPBTC','bitfinex',10]]
bidOffers = [
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 15)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 15)],
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 30)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 30)], 
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 45)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 45)]
             ]
dfExchangeBalances = pd.DataFrame(exchangeBalances, columns=['symbol','exchange','balance'])
dfBidOffers = pd.DataFrame(bidOffers, columns=['symbol','exchange','bid', 'offer', 'created'])
dfBidOffers["spread"] = dfBidOffers["bid"] - dfBidOffers["offer"]
dfSummary = dfExchangeBalances.merge(dfBidOffers, how='left', on=['symbol','exchange'])

我需要完成的是,在“dfSummary”中添加一个计算字段

^{pr2}$

“someOtherRow”是基于“created”的查找(例如,最后一行具有相同的{symbol,exchange},但在30分钟前“created”(与“currentRow”相比)

澄清:上面的例子是手头实际问题的简化。间隔时间不是15分钟。事实上,我需要在DataFrame中查找相应的记录(同一个key={symbol,exchange}),但是第一个这样的记录是在第一个月、第一个季度和第一年创建的。在

我尽量避免手动循环数据帧.iter使用Pandas内置的查找(矢量化)

我在想DataFrame。查找Vectorized look-up of values in Pandas dataframe 但不知道如何从计算字段的上下文中使用它。。。?另外,我希望对相同的数据帧进行查找,而不是针对不同的数据帧进行查找

提前谢谢!在

矢量化(熊猫和Numpy-vs循环):
https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
https://www.datascience.com/blog/straightening-loops-how-to-vectorize-data-aggregation-with-pandas-and-numpy/https://realpython.com/numpy-array-programming/


Tags: 数据httpscomdataframedatetimeexchangebinancesymbol
3条回答

我明白了,这是我的真实代码(所以我不会把所有东西都贴出来)。这将起作用(但不确定是否以最快的方式实现)。在

我在用数据帧应用。这不是矢量化的方法,但应该比python中的循环快得多。有人能不能请你解释一下如何用完全矢量化的方式重写下面的内容?在

引用本文-https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

。。。我无法用矢量化的方式重写,而且考虑到查找的性质,我开始觉得下面的内容不可能是矢量化的(如果你们中有人能证明我错了,我很高兴):

pdPnl = pd.DataFrame.from_records([ObjectUtil.objectPropertiesToDictionary(pnl) for pnl in profitLosses], columns=ObjectUtil.objectPropertiesToDictionary(profitLosses[0]).keys())
pdPnl["TM1"] = pdPnl.apply(lambda rw : rw["COB"] - timedelta(days=1) , axis=1)
pdPnl["MonthStart"] = pdPnl.apply(lambda rw : rw["COB"].replace(day=1), axis=1)
pdPnl["QuarterStart"] = pdPnl.apply(lambda rw : DateTimeUtil.getQuarterStart(rw["COB"], rw["COB"].year), axis=1)
pdPnl["YearStart"] = pdPnl.apply(lambda rw : datetime(rw["COB"].year, 1, 1), axis=1)
pdPnl["DTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["DTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["MTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["MTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["YTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["YTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)

pdPnl["SharpeRatio"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeSharpeRatio(pdPnl, rw["COB"]), axis=1)
pdPnl["MaxDrawDown"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeMaxDrawDown(pdPnl, rw["COB"]), axis=1)

pnlDict = pdPnl.to_dict()  # Then convert back to List of ProfitLoss (Slow...)

查找函数包括:

^{pr2}$

这是假设created具有恒定的15分钟间隔。您可以groupby符号和交换,并下移2(2个时段,因为每个时段是15分钟):

dfSummary['30min_ago_spread'] = dfSummary.groupby(['symbol', 'exchange'])['spread'].shift(2)

输出:

^{pr2}$

矢量化!!!!!!!!(嗯。。。大部分情况下)

想法是,使用“merge”(self-join),就像“DataFrame.lookup”一样,这是针对完全不同的应用程序,例如:Pandas DataFrame.lookup

从原始修复扩展。。。在

步骤1)利润损失.py\预先计算TM1、MonthStart、QuarterStart、yearlstart,因为无论如何都要调用它。在

import datetime
import time
import math

from Util import ObjectUtil
from Util import DateTimeUtil

import pandas as pd
import numpy as np

from Util import ObjectUtil

class ProfitLoss(object):
    def set(self, field, val):
        setattr(self, field, val)

    def to_dict(self):
        result = ObjectUtil.objectPropertiesToDictionary(self)
        result["TM1"] = self.COB - datetime.timedelta(days=1)
        result["MonthStart"] = self.COB.replace(day=1)
        result["QuarterStart"] = DateTimeUtil.getQuarterStart(self.COB, self.COB.year)
        result["YearStart"] = datetime.datetime(self.COB.year, 1, 1)

        return result

    @staticmethod
    def from_dict(dict):
        if dict is None:
            return None

        profitLosses = []
        for k, v in dict.items():
            numPnl = len(v)
            for i in range(0, numPnl):
                pnl = ProfitLoss()
                profitLosses.append(pnl)
            break

        for k, v in dict.items():
            if k == "from_dict":
                break

            i = 0
            for val in v.values():
                if isinstance(val, pd.Timestamp):
                    val = datetime.datetime(val.year, val.month, val.day)

                val = None if val == np.nan else val

                if isinstance(val, float) and math.isnan(val):
                    val = None

                profitLosses[i].set(k, val)
                i+=1

        return profitLosses

第2步)合并(即自连接),而不是DataFrame。应用或DataFrame。查找

^{pr2}$

实际上,我不确定merge/self-join是否比显式循环更高效。另外,我还不知道该怎么做,尽管夏普比率MaxDrawdown!!熊猫的窗口功能似乎没有帮助。。。在

人?!谢谢!!在

相关问题 更多 >