擅长:python、mysql、java
<p>请注意,<code>pd.read_csv()</code>如果读取列数可变的csv,则会抛出错误,除非您事先提供列名。这应该做到:</p>
<pre><code>import pandas as pd
import numpy as np
df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|')
df[df[['Val1','Val2','Val3','Val4']]>2] = np.nan
df['Final'] = df.iloc[:,2:].sum(axis=1)
df = df[['Index','Date','Final']]
</code></pre>
<p>给出:</p>
^{pr2}$
<p>这里有一个更简洁的方法(它非常类似于@Scott Boston下面的答案,但是避免了创建单独的数据帧)。将csv的前两列设置为dataframe的索引,可以有条件地过滤只包含float值的其余dataframe:</p>
<pre><code>df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|').set_index(['Index','Date'])
df['Final'] = df[(df>0) & (df<=2)].sum(axis=1)
df.reset_index()[['Index','Date','Final']].to_csv('output.csv', index=False, header=False)
</code></pre>
<p>给出:</p>
<pre><code>323,2013-06-03 00:00:00,0.0
323,2013-06-03 01:00:00,1.0
323,2013-06-03 02:00:00,1.5
323,2013-06-03 03:00:00,1.5
323,2013-06-03 04:00:00,0.0
323,2013-06-03 05:00:00,0.5
323,2013-06-03 06:00:00,0.0
323,2013-06-03 07:00:00,3.5
323,2013-06-03 08:00:00,0.5
</code></pre>