计算行出现次数并除以总行数unix/python

3条回答

网友

1楼 · 编辑于 2024-09-19 23:32:39

下面是一个纯AWK解决方案：

<test.in awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR}}'

它使用AWK的数组和特殊变量NR，它跟踪行数。在

让我们仔细分析一下代码。第一个街区

^{pr2}$

对输入中的每一行执行一次。这里$0表示每一行，它被用作数组a上的索引，因此，只计算每行的出现次数。在

第二个街区

END {for (i in a) {print i, "\t", a[i]/NR}}

在输入的末尾执行。此时，a包含输入中每一行的出现次数，并按行本身进行索引：因此，通过循环它，我们可以打印一个行和相关出现的表（我们除以行的总数，NR）。在

网友
2楼 · 编辑于 2024-09-19 23:32:39

from collections import Counter with open('data.txt') as infile: # Counter will treat infile as an iterator and exhaust it counter = Counter(infile) # Don't know if you need sorting but this will sort in descending order counts = ((line.strip(), n) for line, n in counter.most_common()) # Convert to proportional amounts total = sum(counter.values()) probs = [(line, n / total) for line, n in counts] print("\n".join("{}{}".format(*p) for p in probs))
这有几个优点。它迭代文件中的行而不是加载整个文件，它利用现有的Counter功能，它可以排序，并且清楚地知道发生了什么。在

网友
3楼 · 编辑于 2024-09-19 23:32:39

注意uniq只计算重复的行数，并且必须在其前面加上sort，以便考虑文件中的所有行。对于sort | uniq -c，以下使用collections.Counter的代码更有效，因为它根本不需要对任何内容进行排序：

from collections import Counter

with open('test.in') as inf:
    counts = sorted(Counter(line.strip('\r\n') for line in inf).items())
    total_lines = float(sum(i[1] for i in counts))
    for line, freq in counts:
         print("{}\t{:.4f}".format(line, freq / total_lines))

此脚本输出

^{pr2}$
对于你描述中给出的输入。在
但是，如果您只需要合并连续的行，比如uniq -c，请注意使用Counter的任何解决方案都会给出问题中给出的输出，但是您的uniq -c方法将而不是。uniq -c will be的输出：
1 english<tab>walawala 2 foo bar<tab>laa war 2 hello world<tab>walo lorl 1 foo bar<tab>laa war
不
1 english<tab>walawala 3 foo bar<tab>laa war 2 hello world<tab>walo lorl
如果这是您想要的行为，您可以使用^{}：
from itertools import groupby with open('foo.txt') as inf: grouper = groupby(line.strip('\r\n') for line in inf) items = [ (k, sum(1 for j in i)) for (k, i) in grouper ] total_lines = float(sum(i[1] for i in items)) for line, freq in items: print("{}\t{:.4f}".format(line, freq / total_lines))
不同之处在于，给定一个test.in包含您指定的内容，uniq管道将而不是生成您在示例中给出的输出，而您将得到：
english<tab>walawala<tab>0.1667 foo bar<tab>laa war<tab>0.3333 hello world<tab>walo lorl<tab>0.3333 foo bar<tab>laa war<tab>0.1667
由于这不是您的输入示例所说的，可能是没有sort就不能使用uniq来解决问题，那么您需要求助于我的第一个示例，Python肯定会比Unix命令行更快。在
顺便说一句，这些功能在所有python>；2.6中都是一样的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章