从WKT转换时GeoPandas的性能 - 问答 - Python中文网

从WKT转换时GeoPandas的性能

2024-05-07 14:16:23 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我需要将大约1000万条记录从PostGIS数据库读取到GeoPandas数据框中。通过以下方式直接从数据库读取数据大约需要15分钟：

geopandas.GeoDataFrame.from_postgis(sql, engine)

这是可以接受的，但我一直在尝试通过使用PostgreSQL COPY命令和SQLAlchemy COPY_导出函数来提高读取性能。使用此方法将数据读取到Pandas数据帧大约需要60秒，这是一个巨大的改进：

def read_data(engine, sql):
    with tempfile.TemporaryFile() as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        df = pandas.read_csv(tmpFile)
        return df

当尝试执行相同操作，但将数据读入GeoPandas数据帧时，我遇到了与另一个进程正在使用的临时文件相关的问题：

def read_data(engine, sql):
    with tempfile.NamedTemporaryFile(suffix='.csv') as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        gdf = geopandas.read_file(tmpFile.name)
        return gdf

fiona.errors.DriverError: C:\Temp\4\tmpiuu6dvl4.csv: file used by other process

我尝试了各种方法来释放临时文件上的锁，但没有成功，因此我返回到将数据读取到Pandas dataframe中，然后转换geometry列。这可以工作，但所需时间与直接从数据库读取数据到GeoPandas数据帧所需时间相同：

def read_data(engine, sql):
    with tempfile.TemporaryFile() as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        df = pandas.read_csv(tmpFile)
        df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])
        return geopandas.GeoDataFrame(df, geometry='geom', crs='EPSG:3857')

需要花费很长时间的部分是从WKT到GeoSeries的转换：

df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])

有人知道解决锁定文件问题或加快从WKT到GeoSeries转换的解决方案吗

谢谢

Tags： csv 数据 df read sql query con head

1条回答

网友

1楼 · 发布于 2024-05-07 14:16:23

GeoPandas必须创建几何体对象，这需要时间。无论是使用GeoDataFrame.from_postgis还是自定义代码，这都无关紧要，因为即使read_data有效，也会以几何图形的WKT/WKB表示结束，并且无论如何都必须调用from_wkt

GeoPandas目前依靠shapely进行转换，但它有pygeos的实验支持，这可能会更快。确保您的环境中有pygeos，然后重试GeoDataFrame.from_postgis。该代码已经进行了很好的优化，所以我不相信您可以通过使用自定义代码轻松获得加速

要获取pygeos：

# conda
conda install pygeos  channel conda-forge
# pip
pip install pygeos

见https://geopandas.readthedocs.io/en/latest/getting_started/install.html#using-the-optional-pygeos-dependency

相关问题更多 >

编程相关推荐

热门问题

热门文章