合并pandas数据帧，其中一个值位于另两个值之间

3条回答

网友

1楼 · 编辑于 2024-07-05 11:40:33

目前还没有pandami这样做的方式

这个答案过去是关于处理多态性的问题，结果发现这是一个非常糟糕的主意。

然后^{}函数出现在另一个答案中，但是没有什么解释，所以我想我会澄清如何使用这个函数。

分段的Numpy方式（内存很重）

^{}函数可用于生成自定义联接的行为。这涉及到很多开销，而且效率不高，但它确实起到了作用。

连接的生产条件

import pandas as pd
from datetime import datetime


presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})

start_end_date_tuples = zip(terms.start_date.values, terms.end_date.values)
conditions = [(war_declarations.date.values >= start_date) &
              (war_declarations.date.values <= end_date) for start_date, end_date in start_end_date_tuples]

> conditions
[array([ True,  True], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool)]

这是一个数组列表，其中每个数组告诉我们，对于我们拥有的两个战争声明中的每一个，时间跨度这个词是否匹配。条件会随着更大的数据集而爆炸，因为它将是左df和右df的长度相乘。

分段“魔术”

现在分段将从术语中取出president_id，并将其放入每个对应的war的war_declarations数据帧中。

war_declarations['president_id'] = np.piecewise(np.zeros(len(war_declarations)),
                                                conditions,
                                                terms.president_id.values)
    date        name                president_id
0   2001-09-14  War in Afghanistan          43.0
1   2003-03-03  Iraq War                    43.0

现在要完成这个例子，我们只需要定期合并总统的名字。

war_declarations.merge(presidents, on="president_id", suffixes=["_war", "_president"])

    date        name_war            president_id    name_president
0   2001-09-14  War in Afghanistan          43.0    Bush
1   2003-03-03  Iraq War                    43.0    Bush

多态性（不起作用）

我想和大家分享我的研究成果，所以即使这不能解决问题，我也希望至少能把它作为一个有用的回答留在这里。由于很难发现错误，其他人可能会尝试这样做，并认为他们有一个可行的解决方案，而事实上，他们没有

我唯一能想到的另一个方法是创建两个新类，一个PointInTime和一个Timespan

两者都应该有__eq__方法，如果将PointInTime与包含它的Timespan进行比较，则返回true。

之后，您可以用这些对象填充数据框，并连接它们所在的列。

像这样的：

class PointInTime(object):

    def __init__(self, year, month, day):
        self.dt = datetime(year, month, day)

    def __eq__(self, other):
        return other.start_date < self.dt < other.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)

class Timespan(object):
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date

    def __eq__(self, other):
        return self.start_date < other.dt < self.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day,
                                             self.end_date.year, self.end_date.month, self.end_date.day)

重要提示：我没有对datetime进行子类划分，因为pandas会将datetime对象列的dtype视为datetime dtype，并且由于timespan不是，pandas会自动拒绝在它们上合并。

如果我们实例化这些类的两个对象，现在可以比较它们：

pit = PointInTime(2015,1,1)
ts = Timespan(datetime(2014,1,1), datetime(2015,2,2))
pit == ts
True

我们还可以用这些对象填充两个数据帧：

df = pd.DataFrame({"pit":[PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3)]})

df2 = pd.DataFrame({"ts":[Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1))]})

然后是融合的作品：

pd.merge(left=df, left_on='pit', right=df2, right_on='ts')

        pit                    ts
0  2015-2-2  2015-2-1 -> 2015-2-5
1  2015-2-2  2015-2-1 -> 2015-4-1

但只有一种。

PointInTime(2015,3,3)也应该包含在Timespan(datetime(2015,2,1), datetime(2015,4,1))上的此联接中

但事实并非如此。

我想熊猫比较了PointInTime(2015,3,3)和PointInTime(2015,2,2)，并假设由于它们不相等，PointInTime(2015,3,3)不能等于Timespan(datetime(2015,2,1), datetime(2015,4,1))，因为这个时间跨度等于PointInTime(2015,2,2)

有点像这样：

Rose == Flower
Lilly != Rose

因此：

Lilly != Flower

编辑：

我试图使所有的点彼此相等，这改变了连接的行为，将2015-3-3包括在内，但2015-2-2只包括在2015-2-1->；2015-2-5时间段内，所以这加强了我的上述假设。

如果有人有任何其他想法，请评论，我可以试试。

网友

2楼 · 编辑于 2024-07-05 11:40:33

您现在应该可以使用包pandasql执行此操作了

import pandasql as ps

sqlcode = '''
select A.cusip
from A
inner join B on A.cusip=B.ncusip
where A.fdate >= B.namedt and A.fdate <= B.nameenddt
group by A.cusip
'''

newdf = ps.sqldf(sqlcode,locals())

我认为来自“ChuHo”的答案是好的。我相信pandasql也在为你做同样的事。我还没有把两者作为基准，但它更容易阅读。

网友

3楼 · 编辑于 2024-07-05 11:40:33

正如您所说，这在SQL中很容易实现，那么为什么不在SQL中实现呢？

import pandas as pd
import sqlite3

#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)

qry = '''
    select  
        start_date PresTermStart,
        end_date PresTermEnd,
        wars.date WarStart,
        presidents.name Pres
    from
        terms join wars on
        date between start_date and end_date join presidents on
        terms.president_id = presidents.president_id
    '''
df = pd.read_sql_query(qry, conn)

数据框：

         PresTermStart          PresTermEnd             WarStart  Pres
0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush
1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush

目前还没有pandami这样做的方式

分段的Numpy方式（内存很重）

连接的生产条件

分段“魔术”

多态性（不起作用）

相关问题更多 >

编程相关推荐

热门问题

热门文章