<p>如果您想处理自己的输入格式,您需要
一些假设。对于这个代码示例,我假设“h1”出现在三行集合之间。如果中间允许,代码需要稍微不同。你知道吗</p>
<p>想法:</p>
<ul>
<li><p>编写一个生成器函数,循环遍历文本并以字典形式返回每一整行。</p></li>
<li><p>全部收集</p></li>
<li><p>当您将问题标记为“pandas”时,将结果移到pandas数据框中</p></li>
</ul>
<p>这是一个有效的例子。你知道吗</p>
<pre><code>import pandas as pd
example_text="""NameA
WorkplaceA
And a abstractA
NameB
WorkplaceB
And a abstractB
<h1>
NameC
WorkplaceC
And a abstractC"""
def next_name(mystr):
lines = iter(mystr.split('\n'))
while True:
n = {'NameCol':None,
'WorkplaceCol':None,
'AbstractCol':None
}
try:
n['NameCol'] = next(lines)
if n['NameCol'] == '<h1>':
continue
n['WorkplaceCol'] = next(lines)
if n['WorkplaceCol'] == '<h1>':
continue
n['AbstractCol'] = next(lines)
if n['AbstractCol'] == '<h1>':
continue
yield n
except StopIteration:
break
df = pd.DataFrame(next_name(example_text), columns=['NameCol','WorkplaceCol','AbstractCol'])
print(df)
</code></pre>
<p>数据帧打印为</p>
<pre><code> NameCol WorkplaceCol AbstractCol
0 NameA WorkplaceA And a abstractA
1 NameB WorkplaceB And a abstractB
2 NameC WorkplaceC And a abstractC
</code></pre>
<p>如果您需要像您的示例一样打印数据帧,
下面是示例代码。你知道吗</p>
<pre><code>print(''.join(f'{x}\t' for x in df.columns))
print()
for row in df.iterrows():
print(''.join(f'{x}\t' for x in row[1]))
</code></pre>
<p>输出</p>
<pre><code>NameCol WorkplaceCol AbstractCol
NameA WorkplaceA And a abstractA
NameB WorkplaceB And a abstractB
NameC WorkplaceC And a abstractC
</code></pre>
<p>注意:我使用的是python3.6,如果您使用的是旧版本,则需要更改print命令。你知道吗</p>
<p>相比之下,使用Pandas可以这样做(使用上面代码中的示例)</p>
<pre><code>df = pd.DataFrame(example_text.split('\n'))
df = df[df[0] != '<h1>'].reset_index().copy()
df['row'] = df.index // 3
result = df.groupby('row').agg(lambda x: list(x))[0].values
print('\t'.join(["NameCol", "WorkplaceCol", "AbstractCol"]))
print('')
print('\n'.join(['\t'.join(x) for x in result]))
</code></pre>
<p>输出相同的结果。你知道吗</p>
<pre><code>NameCol WorkplaceCol AbstractCol
NameA WorkplaceA And a abstractA
NameB WorkplaceB And a abstractB
NameC WorkplaceC And a abstractC
</code></pre>