擅长:python、mysql、java
<p>在<strong><code>Python</code></strong>中,我们可以使用正则表达式<strong><code>split</code></strong>,我们基于不一致的空间分割数据</p>
<pre><code>import re
re.split("\\s+",'a b c')
['a', 'b', 'c']
</code></pre>
<p><strong><code>In Pyspark:</code></strong></p>
<pre><code>#sample data
$ cat i.txt
one two three four five
six seven eight nine ten
</code></pre>
<hr/>
<pre><code>cols=["col1","col2","col3","col4","col5"]
spark.sparkContext.textFile("<file_path>/i.txt").map(lambda x:re.split("\\s+",x)).toDF(cols).show()
#creating dataframe on the file with inconsistent spaces.
#+ + -+ -+ + +
#|col1| col2| col3|col4|col5|
#+ + -+ -+ + +
#| one| two|three|four|five|
#| six|seven|eight|nine| ten|
#+ + -+ -+ + +
</code></pre>