矢量化查找另一行+计算字段

2024-09-19 23:38:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这个数据框“dfSummary”——

exchangeBalances = [['ETHBTC','binance',10], ['LTCBTC','binance',10], ['XRPBTC','binance',10], ['ETHBTC','bitfinex',10], ['LTCBTC','bitfinex',10], ['XRPBTC','bitfinex',10]]
bidOffers = [
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 15)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 15)],
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 30)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 30)], 
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 45)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 45)]
             ]
dfExchangeBalances = pd.DataFrame(exchangeBalances, columns=['symbol','exchange','balance'])
dfBidOffers = pd.DataFrame(bidOffers, columns=['symbol','exchange','bid', 'offer', 'created'])
dfBidOffers["spread"] = dfBidOffers["bid"] - dfBidOffers["offer"]
dfSummary = dfExchangeBalances.merge(dfBidOffers, how='left', on=['symbol','exchange'])

我需要完成的是,在“dfSummary”中添加一个计算字段:

currentRow["Spread"] - someOtherRow["Spread"]

“someOtherRow”是基于“已创建”(例如,最后一行具有相同的{symbol,exchange}但在30分钟前“已创建”(与“currentRow”相比)的查找

澄清:上面的例子是对手头实际问题的简化。时间间隔不完全是15分钟。事实上,我需要在DataFrame中查找相应的记录(相同的键={symbol,exchange}),但第一个这样的记录是在第一个月、第一个季度和第一年创建的

我试图避免在DataFrame.iter上手动循环,而是使用Pandas内置查找(矢量化)

我在考虑数据帧。查找Vectorized look-up of values in Pandas dataframe 但不确定如何从计算字段的上下文中使用它…?同样,我希望对相同的数据帧进行查找,而不是对不同的数据帧进行查找

矢量化(熊猫和Numpy-vs循环):


Tags: 数据httpscomdataframedatetimeexchangebinancesymbol
3条回答

我明白了,这是我的真实代码(所以我不会发布所有内容)。这将起作用(但不确定是否以最快的方式实施)

我使用的是数据帧.apply。这不是矢量化的方式,但应该比python中的循环快得多。有人能告诉我们如何用矢量化的方式重写下面的内容吗

参考本文-https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

。。。我无法用矢量化的方式来重写,鉴于查找的性质,我开始觉得下面的内容不能矢量化(如果你们中有人能证明我错了,我很高兴):

pdPnl = pd.DataFrame.from_records([ObjectUtil.objectPropertiesToDictionary(pnl) for pnl in profitLosses], columns=ObjectUtil.objectPropertiesToDictionary(profitLosses[0]).keys())
pdPnl["TM1"] = pdPnl.apply(lambda rw : rw["COB"] - timedelta(days=1) , axis=1)
pdPnl["MonthStart"] = pdPnl.apply(lambda rw : rw["COB"].replace(day=1), axis=1)
pdPnl["QuarterStart"] = pdPnl.apply(lambda rw : DateTimeUtil.getQuarterStart(rw["COB"], rw["COB"].year), axis=1)
pdPnl["YearStart"] = pdPnl.apply(lambda rw : datetime(rw["COB"].year, 1, 1), axis=1)
pdPnl["DTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["DTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["MTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["MTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["YTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["YTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)

pdPnl["SharpeRatio"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeSharpeRatio(pdPnl, rw["COB"]), axis=1)
pdPnl["MaxDrawDown"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeMaxDrawDown(pdPnl, rw["COB"]), axis=1)

pnlDict = pdPnl.to_dict()  # Then convert back to List of ProfitLoss (Slow...)

查找功能包括:

@staticmethod
def lookUpRow(pdPnl, cob):
    return pdPnl[pdPnl["COB"]==cob]

@staticmethod
def computeField(pdPnl, cob, todaysPnl, targetField):
    val = np.nan
    otherRow = PnlCalculatorBase.lookUpRow(pdPnl, cob)
    if otherRow is not None and otherRow[targetField].shape[0]>0:
        try:
            tm1InceptionRealizedPnl = otherRow[targetField].iloc[0]
            val = todaysPnl - tm1InceptionRealizedPnl
        except:
            # slow...
            errMsg = "Failed lookup for " + str(cob) + " " + targetField
            logging.error(errMsg)
            val = np.nan
    return val


@staticmethod
def computeSharpeRatio(pdPnl, cob):
    val = None
    pdPnl = pdPnl[(pdPnl['COB']<=cob)]
    pdPnl = pdPnl.loc[:,["COB", "DTDRealizedPnl","DTDUnrealizedPnl"]]
    pdPnl["TotalDTD"] = pdPnl.apply(lambda rw : rw["DTDRealizedPnl"] + rw["DTDUnrealizedPnl"], axis=1)

    # @todo, We don't have risk free rate for Sharpe Ration calc. Here's just total DTD avg return over standard deviation
    # https://en.wikipedia.org/wiki/Sharpe_ratio
    mean = pdPnl["TotalDTD"].mean()
    std = pdPnl["TotalDTD"].std()
    val = mean / std

    return val

@staticmethod
def computeMaxDrawDown(pdPnl, cob):
    val = None
    pdPnl = pdPnl[(pdPnl['COB']<=cob) & (pdPnl["DTDRealizedPnl"]<0)]
    val = pdPnl["DTDRealizedPnl"].min()
    return val

矢量化!!!!!!!!(嗯……大部分情况下)

想法是,使用“合并”(自连接)作为“DataFrame.查找”,它适用于完全不同的应用程序,例如:Pandas DataFrame.lookup

从原始修复扩展

步骤1)ProfitLoss.py\用于预测TM1、月开始、季度开始、年开始,因为无论如何都要调用它

import datetime
import time
import math

from Util import ObjectUtil
from Util import DateTimeUtil

import pandas as pd
import numpy as np

from Util import ObjectUtil

class ProfitLoss(object):
    def set(self, field, val):
        setattr(self, field, val)

    def to_dict(self):
        result = ObjectUtil.objectPropertiesToDictionary(self)
        result["TM1"] = self.COB - datetime.timedelta(days=1)
        result["MonthStart"] = self.COB.replace(day=1)
        result["QuarterStart"] = DateTimeUtil.getQuarterStart(self.COB, self.COB.year)
        result["YearStart"] = datetime.datetime(self.COB.year, 1, 1)

        return result

    @staticmethod
    def from_dict(dict):
        if dict is None:
            return None

        profitLosses = []
        for k, v in dict.items():
            numPnl = len(v)
            for i in range(0, numPnl):
                pnl = ProfitLoss()
                profitLosses.append(pnl)
            break

        for k, v in dict.items():
            if k == "from_dict":
                break

            i = 0
            for val in v.values():
                if isinstance(val, pd.Timestamp):
                    val = datetime.datetime(val.year, val.month, val.day)

                val = None if val == np.nan else val

                if isinstance(val, float) and math.isnan(val):
                    val = None

                profitLosses[i].set(k, val)
                i+=1

        return profitLosses

步骤2)合并(即自连接),而不是数据帧。应用或数据帧。查找

        pdPnl = pd.DataFrame.from_records([pnl.to_dict() for pnl in profitLosses])
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["TM1"], right_on=["COB"], suffixes = ('','_tm1'))
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["MonthStart"], right_on=["COB"], suffixes = ('','_MonthStart'))
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["QuarterStart"], right_on=["COB"], suffixes = ('','_QuaterStart'))
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["YearStart"], right_on=["COB"], suffixes = ('','_YearStart'))

        # Vectorized
        pdPnl["DTDRealizedPnl"] = pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_tm1"]
        pdPnl["DTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_tm1"]
        pdPnl["MTDRealizedPnl"] =  pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_MonthStart"]
        pdPnl["MTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_MonthStart"]
        pdPnl["YTDRealizedPnl"] = pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_YearStart"]
        pdPnl["YTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_YearStart"]

        # Not yet vectorized
        pdPnl["SharpeRatio"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeSharpeRatio(pdPnl, rw["COB"]), axis=1)
        pdPnl["MaxDrawDown"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeMaxDrawDown(pdPnl, rw["COB"]), axis=1)

        pnlDict = pdPnl.to_dict()
        updatedProfitLosses = ProfitLoss.ProfitLoss.from_dict(pnlDict)

实际上,我不确定合并/自连接是否比显式循环更有效。而且,我还没有弄清楚该怎么做Sharpe RatioMaxDrawdown!!熊猫的窗口功能似乎没有帮助

人?!谢谢

这是假设created有恒定的15分钟间隔。您可以groupby符号和交换,并向下移动2(两个时段,因为每个时段为15分钟):

dfSummary['30min_ago_spread'] = dfSummary.groupby(['symbol', 'exchange'])['spread'].shift(2)

输出:

   symbol exchange  balance   offer  spread  created               30min_ago_spread
0  ETHBTC  binance       10  0.0351 -0.0316  2018-09-01 08:15:00       NaN
1  ETHBTC  binance       10  0.0351 -0.0316  2018-09-01 08:30:00       NaN
2  ETHBTC  binance       10  0.0351 -0.0316  2018-09-01 08:45:00   -0.0316
3  LTCBTC  binance       10  0.0920 -0.0830  2018-09-01 08:15:00       NaN
4  LTCBTC  binance       10  0.0920 -0.0830  2018-09-01 08:30:00       NaN

相关问题 更多 >