<p>实际上,每当您需要重构涉及列和表之间关系的数据时,可以考虑使用关系数据库管理系统(RDMS)的SQL解决方案。尤其是当你的数据来自数据库时。离开熊猫进行数据分析。当然,如果你不是在数据库中存储大数据,那就完全是另一个问题了!在</p>
<p>Python为<a href="https://docs.python.org/2/library/sqlite3.html" rel="nofollow">SQLite</a>提供了一个内置库,该库是流行的免费、开放源代码的文件级数据库。此外,还可以安装用于MySQL、sqlserver、PostgreSQL、Oracle和其他rdms的Python库。您可以将每个连接无缝地集成到<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html" rel="nofollow">pandas</a>。下面是实现条件组最大值的三个等效查询版本。每个版本假设您在源表中维护一个自动编号主键索引<code>ID</code>,在这里命名为<code>RollingMax</code>。在</p>
<pre><code>import sqlite3 as lite
import pandas as pd
con = lite.connect('C:\\Path\\SQLite\\DB.db')
# SQL WITH DERIVED TABLES
sql = """SELECT a, b,
(SELECT Max(dtbl2.B)
FROM
(SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1) dtbl2
WHERE dtbl1.ID >= dtbl2.ID
AND dtbl1.GrpA = dtbl2.GrpA) As rm
FROM
(
SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1
) As dtbl1;"""
# SQL USING CTE WINDOW FUNCTION (AVAILABLE AS OF VERSION 3.8.3)
sql = """WITH grp (ID, a, b, GrpA)
AS (
SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1
)
SELECT a, b,
(SELECT Max(dtbl2.B)
FROM grp AS dtbl2
WHERE dtbl1.ID >= dtbl2.ID
AND dtbl1.GrpA = dtbl2.GrpA) As rm
FROM grp AS dtbl1;"""
# SQL USING SAVED VIEW
'''To be saved inside database'''
saved_view = """SELECT t1.ID, t1.a, t1.b,
(SELECT Count(*) FROM RollingMax t2
WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
FROM RollingMax t1;"""
sql = """SELECT a, b,
(SELECT Max(dtbl2.B)
FROM saved_view AS dtbl2
WHERE dtbl1.ID >= dtbl2.ID
AND dtbl1.GrpA = dtbl2.GrpA) As rm
FROM saved_view As dtbl1;"""
df = pd.read_sql(sql, conn)
</code></pre>
<p><strong>输出</strong><em>(这里唯一的挑战是第一个分组,前面没有a==1)</em></p>
^{pr2}$