<p>我在这里对你们有点深入,因为我假设你们是以科学的名义做这件事的,如果我能帮助一些试图理解气候变化的人,那么这是一个很好的理由</p>
<p>在查看数据之后,我注意到问题与存储在非规范化结构中的数据有关。有两种方法可以让你从我的头脑中直接解决这个问题。我将展示将文件重新写入另一个文件以加载到pandas或dask中,因为这可能是思考它的最简单的方法(但对于那些在评论中不可避免地会烤到我的人来说肯定不是最有效的方法)</p>
<p>将其视为两个独立的表,具有一对多的关系。1个台风表和另一个台风数据表</p>
<p>一种体面但并非真正有效的方法是将其重写为更好的嵌套结构,如JSON。然后使用它加载数据。请注意两种不同类型的列</p>
<p><strong>步骤1:映射出数据</p>
<p>这里一张桌子里有两张桌子。每个台风都将显示为一行,如下所示:<br/>
<code>66666 9119 150 0045 9119 0 6 MIRREILE 19920701</code></p>
<p>而该台风的记录将跟随该行(将其视为单独的一行:<br/>
<code>20080100 002 3 178 1107 994 035 00000 0000 30600 0200 </code></p>
<p>加载文件,将其作为原始行读取。通过使用<a href="https://www.tutorialspoint.com/python3/file_readlines.htm" rel="nofollow noreferrer">.readlines() method</a>,我们可以将中的每一行作为列表中的一项读取</p>
<pre class="lang-py prettyprint-override"><code># load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
</code></pre>
<p>现在我们已经读入了,我们需要执行一些逻辑来将一些行与其他行分开。似乎每次有台风记录时,该行前面都会有一个“66666”,所以让我们把它去掉。因此,考虑到我们在一个效率极低的循环中查看每一行,我们可以编写一些if/else逻辑来查看:</p>
<pre class="lang-py prettyprint-override"><code>if row[:5] == '66666':
# do stuff
else:
# do other stuff
</code></pre>
<p>现在,这将是一个非常可靠的分离逻辑的方法,这将有助于指导拆分逻辑。现在,我们需要编写一个循环来检查每一行的逻辑:</p>
<pre class="lang-py prettyprint-override"><code># initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
# read through lines list from the .readlines(), looping sequentially
for line in lines:
write_typhoon(line, collection)
</code></pre>
<p>最后,我们需要编写一些逻辑,以便在write_typhone()函数中的if/then循环中以某种方式提取数据。我不想在这里做很多思考,而是选择了我能做的最简单的方法:自己定义fwf元数据。因为“yolo”:</p>
<pre class="lang-py prettyprint-override"><code>def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
# clean that whitespace
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
# clean that whitespace
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
</code></pre>
<p>好吧,我花了比我愿意承认的时间更长的时间。我不会说谎。给了我写COBOL程序的PTSD闪回</p>
<p>不管怎样,现在我们有了一个很好的嵌套数据结构,它是本机python类型的。乐趣就可以开始了</p>
<p><strong>步骤2:将其加载到可用格式中</strong></p>
<p>为了分析它,我假设你会想在熊猫身上看到它(如果它太大的话,也许是达斯克)。下面是我在这方面的想法:</p>
<pre class="lang-py prettyprint-override"><code>import pandas as pd
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
</code></pre>
<p>在本<a href="https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe">question</a>的答案中可以找到一个很好的参考(特别是第二个,而不是选定的一个)</p>
<p><strong>现在就把它们放在一起:</strong></p>
<pre class="lang-py prettyprint-override"><code>from typing import Dict
import pandas as pd
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
# read through file sequentially
for line in lines:
write_typhoon(line, collection)
# load to pandas df using json_normalize
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
print(df.head(20)) # lets see what we've got!
</code></pre>