并行数据帧自定义函数Dask

2024-09-30 08:16:42 发布

男 | 程序猿一只，喜欢编程写python代码。

我试图使用Dask通过Dask的多处理特性来加速Python数据帧for循环操作。我完全知道循环数据帧的方法通常不是最佳实践，但在我的情况下，它是必需的。我已经广泛阅读了文档和其他类似的问题，但我似乎无法解决我的问题

df.head()
         Title                                                                                                                                       Content
0  Lizzibtz     @Ontario2020 @Travisdhanraj @fordnation Maybe.  They are not adding to the stress of education during Covid. Texas sample.  Plus…  
1  Jess 🌱🛹🏳️‍🌈  @BetoORourke So ashamed at how Abbott has not handled COVID in Texas. A majority of our large cities are hot spots with no end in sight.    
2  sidi diallo  New post (PVC Working Gloves) has been published on Covid-19 News Info - Texas test                    
3  Kautillya    @PandaJay What was the need to go to SC for yatra anyway? Isn't covid cases spiking exponentially? Ambubachi mela o… texas
4  SarahLou♡    RT @BenJolly9: 23rd June 2020 was the day Sir Keir Starmer let the Tories off the hook for their miss-handling of COVID-19. texas

我有一个自定义python函数，定义如下：

def locMp(df):
    hitList = []
    for i in range(len(df)):
        print(i)
        string = df.iloc[i]['Content']
        # print(string)
        doc = nlp(string)
        ents = [e.text for e in doc.ents if e.label_ == "GPE"]
        x = np.array(ents)
        print(np.unique(x))
        hitList.append(np.unique(x))

    df['Locations'] = hitList
    return df

这个函数添加了一个从名为spacy的库中提取的位置的dataframe列-我认为这并不重要，但我想让您了解整个函数

现在，通过文档和其他一些问题。对数据帧使用Dask的多处理的方法是创建一个Dask数据帧，对它进行分区，map_partitions和.compute()。因此，我尝试了以下方法和其他一些方法，但没有成功：

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(lambda df: df.apply(locMp), meta=pd.DataFrame).compute()

# and...

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(locMp, meta=pd.DataFrame).compute()

# and simplifying from Dask documentation

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(locMp)

我用dask.delayed尝试了其他一些方法，但似乎没有任何效果。我要么得到一个Dask系列，要么得到一些其他不想要的输出，要么这个函数需要与定期运行它一样长或更长的时间。如何使用Dask加速自定义数据帧函数操作并返回干净的数据帧

多谢各位

Tags： the to 数据方法函数 in from map

1条回答

网友

1楼 · 发布于 2024-09-30 08:16:42

您可以尝试让Dask处理应用程序，而不是自己进行循环：

ddf["Locations"] = ddf["Content"].apply(
    lambda string: [e.text for e in nlp(string).ents if e.label_ == "GPE"],
    meta=("Content", "object"))

并行数据帧自定义函数Dask

相关问题更多 >

编程相关推荐

热门问题

热门文章

并行数据帧自定义函数Dask

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >