确定点位于哪个多边形中，然后将该多边形的名称作为新列应用于大数据帧

# intialise data of lists. ships_data = {'imoNo':[9321483, 9321483, 9321483, 9321483, 9321483], 'Timestamp':['2020-02-22 00:00:00', '2020-02-22 00:10:00', '2020-02-22 00:20:00', '2020-02-22 00:30:00', '2020-02-22 00:40:00'], 'Position Longitude':[127.814598, 127.805634, 127.805519, 127.808548, 127.812370], 'Position Latitude':[33.800232, 33.801899, 33.798885, 33.795799, 33.792931]} # Create DataFrame ships_df = pd.DataFrame(ships_data)

### Load libraries import numpy as np import pandas as pd import geopandas as gp import shapely.speedups from shapely.geometry import Point, Polygon shapely.speedups.enable() ### Check and map lon lat pair with sea name def get_seaname(long,lat): pnt = Point(long,lat) for i,j in enumerate(iho_df.geometry): if pnt.within(j): return iho_df.NAME.iloc[i] ### Apply the above function to the dataframe ships_df['sea_name'] = ships_df.apply(lambda x: get_seaname(x['Position Longitude'], x['Position Latitude']), axis=1)

2条回答

网友

1楼 · 编辑于 2024-06-14 19:21:08

最后，我有一个比最初的问题更快的问题

首先，我使用IHO海域数据集的信息创建了一个描述边界框的多边形

# Create a bbox polygon
iho_df['bbox'] = iho_df.apply(lambda x: Polygon([(x['min_X'], x['min_Y']), (x['min_X'], x['max_Y']), (x['max_X'], x['max_Y']), (x['max_X'], x['min_Y'])]), axis=1)

然后，我更改了函数，以便首先查看bbox（这比几何体快得多，因为它只是一个矩形）。当一个点落在多个长方体（用于海洋边界）内时，它会查看初始多边形，以便在匹配的长方体（而不是所有多边形）中找到正确的海洋名称

# Function that checks and maps lon lat pair with sea name
def get_seaname(long,lat):
    pnt = Point(long,lat)
    names = []
    # Check within each bbox first to note the polygons to look at
    for i,j in enumerate(iho_df.bbox):
        if pnt.within(j):
            names.append(iho_df.NAME.iloc[i])
    # Return nan for no return
    if len(names)==0:
        return np.nan
    # Return the single name of the sea 
    elif len(names)==1:
        return names[0]
    # Run the pnt.within() only for the polygons within the collected sea names
    else:
        limitizez_df = iho_df[iho_df['NAME'].isin(names)].reset_index(drop=True)
        for k,n in enumerate(limitizez_df.geometry):
            if pnt.within(n):
                return limitizez_df.NAME.iloc[k]

这一次大大缩短了时间。为了进一步提高性能，我使用多处理器在CPU内核之间进行并行处理。这个想法来自另一个StackOverflow帖子，我现在不记得了，但下面是代码

import multiprocessing as mp

# Function that parallelizes the apply function among the cores of the CPU
def parallelize_dataframe(df, func, n_cores):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

# Function that adds a sea_name column in the main dataframe
def add_features(df):
    # Apply the function
    df['sea_name'] = df.apply(lambda x: get_seaname(x['Position Longitude'], x['Position Latitude']), axis=1)
    return df

最后，我没有对get_seaname（）使用apply函数，而是将其用于parallelize_dataframe（）函数，以便在所有可用的CPU内核上运行：

### Apply the above function to the dataframe
ships_df = parallelize_dataframe(ships_df, add_features, n_cores=mp.cpu_count())

我希望我的解决方案也能帮助其他人

网友

2楼 · 编辑于 2024-06-14 19:21:08

试试这个

使用apply（无法实现更快的方法，欢迎帮助）

import numpy as np
import pandas as pd
import geopandas as gp
import shapely.speedups
from shapely.geometry import Point, Polygon
shapely.speedups.enable()

# I am still uncomfortable with this. More ideas on speeding up this part are welcome
ships_df['point'] = ships_df.apply(lambda x: Point(x['Position Longitude'], x['Position Latitude']), axis=1)

现在将函数矢量化，以处理点

def get_seaname(pnt:Point):
    for i,j in enumerate(iho_df.geometry):
        if pnt.within(j):
            return iho_df.NAME.iloc[i]

现在，由于您的方法适用于单个点，请将点列转换为点对象的向量，并对您的方法进行向量化

get_seaname = np.vectorize(get_seaname)

ships_df['sea_name'] = pd.Series(get_seaname(ships_df['point'].values))

使用apply（无法实现更快的方法，欢迎帮助）

相关问题更多 >

编程相关推荐

热门问题

热门文章