如何计算符号/字节并制作直方图

2024-10-02 16:30:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文件,里面有很多文本。我是这样读的:

file = open('labb9Text.txt', "r")

for lines in file:
    txt = str(lines)
    byteArr = bytearray(txt, "utf-8")

现在我想写一个函数makeHisto(byteArr),它返回一个直方图 (一个长度为256的列表),指示每个 数字/位模式(0-255)出现在byteArr中。因为我是python新手,现在不知道从哪里开始,有什么建议吗?谢谢


Tags: 文件函数in文本txtforopenutf
2条回答

您可以对文件内容使用[Python 3.Docs]: class collections.Counter([iterable-or-mapping])

>>> import collections
>>>
>>> file_name = r"C:\Windows\comsetup.log"
>>>
>>> with open(file_name, "rb") as fin:
...     text = fin.read()
...
>>> len(text)
771
>>>
>>> text
b'COM+[12:31:53]: ********************************************************************************\r\nCOM+[12:31:53]: Setup started - [DATE:12,24,2019 TIME: 12:31 pm]\r\nCOM+[12:31:53]: ********************************************************************************\r\nCOM+[12:31:53]: Start CComMig::Discover\r\nCOM+[12:31:53]: Return XML stream: <migXml xmlns=""><rules context="system"><include><objectSet></objectSet></include></rules></migXml>\r\nCOM+[12:31:53]: End CComMig::Discover - Return 0x00000000\r\nCOM+[12:31:56]: ********************************************************************************\r\nCOM+[12:31:56]: Setup (COMMIG) finished - [DATE:12,24,2019 TIME: 12:31 pm]\r\nCOM+[12:31:56]: ********************************************************************************\r\n'
>>>
>>> hist = collections.Counter(text)
>>>
>>> hist
Counter({42: 320, 58: 38, 32: 32, 49: 26, 101: 19, 50: 17, 51: 17, 77: 16, 116: 16, 67: 14, 91: 11, 93: 11, 48: 11, 109: 11, 79: 10, 115: 10, 105: 10, 43: 9, 53: 9, 13: 9, 10: 9, 114: 9, 117: 8, 110: 8, 60: 8, 62: 8, 111: 7, 99: 7, 108: 7, 83: 5, 100: 5, 69: 5, 112: 4, 68: 4, 84: 4, 44: 4, 103: 4, 34: 4, 47: 4, 97: 3, 45: 3, 73: 3, 88: 3, 120: 3, 54: 3, 65: 2, 52: 2, 57: 2, 118: 2, 82: 2, 61: 2, 98: 2, 106: 2, 76: 1, 121: 1, 40: 1, 71: 1, 41: 1, 102: 1, 104: 1})
>>>
>>> chr(42).encode()  # For testing purposes only
b'*'
>>>
>>> text.count(b"*")
320

hist是一种映射,其中每个键都是文本中遇到的字节([0..255]),对应的值是其出现次数

试试这个:

import sys

import requests 
from io import StringIO

import seaborn as sns # for data visualization
sns.set()

# To just take a file from https://norvig.com/big.txt
fin = StringIO(requests.get('https://norvig.com/big.txt').content.decode('utf8'))

num_symbols, num_bytes = [], []

for line in fin:
    # Get size of string in bytes.
    num_bytes.append(sys.getsizeof(line))
    # Get no. of chars in string
    num_symbols.append(len(line))

# Plot the graph.
sns.distplot(num_symbols)

# Plot the other graph.
sns.set()
sns.distplot(num_bytes)

最可能的情况是,将它们绘制在一起会提供更多信息,请尝试:

sns.distplot(num_symbols, label="chars")
sns.distplot(num_bytes, label="bytes")

相关问题 更多 >