使用awk在列中排列数据

2024-10-03 02:39:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据300输出.out文件,需要从中获取数据。 通常情况下,数据以如下方式存储在其中:

PROPERTY 1:   1234
lines 
of 
unimportant text
PROPERTY 2: 1334
lines 
of 
unimportant text
PROPERTY 3: 1237
.
.
.
PROPERTY N: 7592

我有300个这样的档案

我想从这些文件中提取数据,并将它们排列成整齐的列。一列表示属性1的所有数据点,一列表示属性2,…,一列表示属性N。最终目标是使用python和pandas进一步处理数据

我正在使用awk提取这些数据

我有两种方法,但每种方法都有一个问题。 方法一: awk '/PROPERTY 1/{p1=$NF; } /PROPERTY 2/{p2=$NF} /PROPERTY 3/... {pn=$NF; print p1, p2, p3,...}' *.out 这种方法有两个问题:

我可以提取单个数据点并将其存储到文件中,但是,这是一个很长的程序。 此外,如果属性1和属性2的位置颠倒,此代码将给出错误的输出,即outputfile1.out中的属性1将显示在第2行,而不是第1行。我如何使其不出现故障

我的第二种方法是简单地将它们输出到不同的文件中,并使用python将它们连接在一起。有没有办法从文件1中选取一列,并使用awk将其粘贴到文件2中的列旁边

示例输入文件:

先出:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
PROPERTY 1:    1234

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit
PROPERTY 2:    9800

At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.

PROPERTY 4:   823586

On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain.

PROPERTY 3:   328497
.
.
.

第二点:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
PROPERTY 1:    1

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit
PROPERTY 2:    2

At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.

PROPERTY 3:   3

On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain.

PROPERTY 4:   4
.
.
.

每个文件都将具有所有属性

预期输出文件: data.txt

1234  9800  823586  328497 ...
1  2  3  4 
.
.
.  

我正在尝试优化我的代码,而awk似乎正在快速发展。如果您有任何建议,我们将不胜感激


Tags: and文件ofthe数据方法属性property
2条回答

将GNU awk用于ENDFILE,并假设您有要打印的属性标记的特定子集,而不是所有属性标记都出现在每个文件中(您发布的示例对此不清楚,或者属性是否都以属性开头,等等):

$ cat tst.awk
BEGIN {
    numTags = split("PROPERTY 1,PROPERTY 2,PROPERTY 3,PROPERTY 4",tags,/,/)
}
{
    tag = $0
    sub(/:.*/,"",tag)
    f[tag] = $NF
}
ENDFILE {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = f[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
    delete f
}

$ awk -f tst.awk first second
1234 9800 328497 823586
1 2 3 4

我会逐行分析:

import re

RE_PROPERTY = re.compile(r"^PROPERTY\s*([0-9]+)\s*:\s*(.*)\s*\n$")

columns = {}

with open("data.out", "r") as f:
    for line in f.readlines():
        m = RE_PROPERTY.match(line)
        if m:
            key = f"PROPERTY {m.group(1)}"
            value = m.group(2)
            col = columns.setdefault(key, [])
            col.append(value)

print(columns)

相关问题 更多 >