如何过滤URL存在与否的spark数据帧？

2024-09-30 00:27:12 发布

您现在位置：Python中文网/ 问答频道 /正文

9511

网友

男 | 程序猿一只，喜欢编程写python代码。

我想过滤我的spark数据帧。在这个数据帧中，有一列URL。

我试图使用os.path.exists(col("url"))来过滤我的数据帧，但是我遇到了如下错误

"string is needed, but column has been found".

这是我代码的一部分，pandas已经在代码中使用，现在我想使用spark来实现以下代码

bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')

# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
    b = bob_ross.loc[s]['TITLE']
    b = b.lower()
    b = re.sub(r'[^a-z0-9\s]', '',b)
    b = re.sub(r'\s', '_',b)
    img = b+".png"
    if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
        bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
        t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
        colors.loc[s] = t

bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""]

下面是我如何用spark实现它，我被困在错误线上

^{pr2}$

Tags：数据代码 image img for data spark science

1条回答

网友

1楼 · 发布于 2024-09-30 00:27:12

你应该使用过滤函数，而不是操作系统函数

例如

df.filter("image is not NULL")

os.path.exists只在本地文件系统上运行，而Spark要在许多服务器上运行，所以这应该是您没有使用正确函数的标志

如何过滤URL存在与否的spark数据帧？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何过滤URL存在与否的spark数据帧？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >