删除列中不一致的空格

网友

1楼 · 编辑于 2024-09-29 21:22:49

如本链接所述How to change tab delimited in to comma delimited in pandas 您可以将分隔符更改为“无”或更改为文本中的特定字符比如：

pd.read_csv(filename,sep=None)

或

file = pd.read_csv(filename, sep="\t")

请随意查看文档，因为它可能会给您一些提示https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

网友

2楼 · 编辑于 2024-09-29 21:22:49

在Python中，我们可以使用正则表达式split，我们基于不一致的空间分割数据

import re
re.split("\\s+",'a b   c')
['a', 'b', 'c']

In Pyspark:

#sample data
$ cat i.txt
one two   three   four   five
six    seven    eight nine ten

cols=["col1","col2","col3","col4","col5"]
spark.sparkContext.textFile("<file_path>/i.txt").map(lambda x:re.split("\\s+",x)).toDF(cols).show()

#creating dataframe on the file with inconsistent spaces.
#+  +  -+  -+  +  +
#|col1| col2| col3|col4|col5|
#+  +  -+  -+  +  +
#| one|  two|three|four|five|
#| six|seven|eight|nine| ten|
#+  +  -+  -+  +  +

网友

3楼 · 编辑于 2024-09-29 21:22:49

这种文件格式称为固定宽度文件pandas有一个专门用于读取此类文件的函数：^{}

默认情况下，pandas将推断每列的宽度。如果您发现这样做有问题，您可以研究colspecs可选参数

您可以使用以下方法将生成的pandas.DataFrame转换为pyspark数据帧：

spark.createDataFrame(pandas_df)

作为documented by pyspark

相关问题更多 >

编程相关推荐

热门问题

热门文章

删除列中不一致的空格

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >