我想过滤我的spark数据帧。在这个数据帧中,有一列URL。
我试图使用os.path.exists(col("url"))
来过滤我的数据帧,但是我遇到了如下错误
"string is needed, but column has been found".
这是我代码的一部分,pandas已经在代码中使用,现在我想使用spark来实现以下代码
bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')
# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
b = bob_ross.loc[s]['TITLE']
b = b.lower()
b = re.sub(r'[^a-z0-9\s]', '',b)
b = re.sub(r'\s', '_',b)
img = b+".png"
if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
colors.loc[s] = t
bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""]
下面是我如何用spark实现它,我被困在错误线上
^{pr2}$
你应该使用过滤函数,而不是操作系统函数
例如
os.path.exists
只在本地文件系统上运行,而Spark要在许多服务器上运行,所以这应该是您没有使用正确函数的标志相关问题 更多 >
编程相关推荐