<p>在这里,我尝试遵循一种通用方法,在这种方法中,您不必在代码中预编程列跨距。用于返回数据帧,可以使用pd.read\U csv文件和斯特林吉奥。请根据您的文件位置修改路径。这段代码是从您的代码中扩展出来的,以使您更容易理解,否则还有更有效的方法来编写相同的逻辑</p>
<pre><code> import re
import pandas as pd
import StringIO
path = "/home/clik/clik/demo.txt"
EndStr = " "
FilterStr = "=================="
FindStr = "empcode Emnname"
def match(sp1, sp2):
disjunct = max(sp1[0] - sp2[1], sp2[0] - sp1[1])
if disjunct >= 0:
return -abs((sp1[0]+sp1[1])/2.0 - (sp2[0]+sp2[1])/2.0)
return float(disjunct) / min(sp1[0] - sp2[1], sp2[0] - sp1[1])
def PrepareList():
with open(path) as f:
out = []
for i, line in enumerate(f):
print line.rstrip()
if line.rstrip().startswith(FindStr):
print(line)
tmp = []
col_spans = [m.span() for m in re.finditer("[^\s][^\s]+", line)]
tmp.append(re.sub("\s+", ",", line.strip()))
# print(tmp)
for line in f:
if line.rstrip().startswith(EndStr):
out.append(tmp)
break
row = [None] * len(col_spans)
for m in re.finditer("[^\s][^\s]+", line):
colmatches = [match(m.span(), cspan) for cspan in col_spans]
max_index = max(enumerate(colmatches), key=lambda e: e[1])[0]
row[max_index] = m.group() if row[max_index] is None else (row[max_index] + ' ' + m.group())
tmp.append(','.join(['NA' if e is None else e for e in row]))
#tmp.append(re.sub("\s+", ",", line.strip()))
#for pandas dataframe
#return pd.read_csv(StringIO.StringIO('\n'.join(tmp)))
#for returning list of tuples
return map(tuple, tmp)
#for returning list of list
#return tmp
f.close()
LstEmp = PrepareList()
</code></pre>
<p>为了将元组列表转换为pyspark数据帧,这里有一个教程<a href="http://bigdataplaybook.blogspot.in/2017/01/create-dataframe-from-list-of-tuples.html" rel="nofollow noreferrer">http://bigdataplaybook.blogspot.in/2017/01/create-dataframe-from-list-of-tuples.html</a></p>