将非结构化日志文件导入pandas

2024-09-29 23:28:22 发布

您现在位置:Python中文网/ 问答频道 /正文

这里有人能告诉我如何将非结构化文件导入熊猫吗

我所说的非结构化是指:

  • 具有以下可变长度行的日志文件:
2021-01-26T09:40:01.192Z info hostd[2101947] [Originator@6876 sub=Default opID=823a15d0] Accepted password for user root from 127.0.0.1
2021-01-26T09:40:01.192Z info hostd[2101947] [Originator@6876 sub=Vimsvc opID=823a15d0] [Auth]: User root
2021-01-26T09:40:01.193Z info hostd[2101947] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=823a15d0] Event 24138 : User root@127.0.0.1 logged in as pyvmomi
2021-01-26T09:40:01.268Z info hostd[2101940] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=823a15de user=root] Event 24139 : User root@127.0.0.1 logged out (login time: Tuesday, 26 January, 2021 09:40:01 AM, number of API invocations: 0, user agent: pyvmomi)

我尝试了多种方法并在谷歌上搜索了一下,但每个人似乎都在导入结构良好的CSV文件,并且找不到任何日志文件导入引用(我不是程序员,只是想用熊猫编写这个小程序)

*多种情况,如:

# giving a range for column names but this is not adequate if I want to search throught the logs for errors later I'd have to use all 54 columns ?! 
 
pd.read_csv("mylog",sep='\s+',header=None,error_bad_lines=False, engine="python",quoting=csv.QUOTE_NONE,names=range(55))

# or putting everything into index :D 
pd.read_csv("mylog",sep='\t', lineterminator='\n', index_col=0)
*oh yeah, want to use timeframe as INDEX column* 

pd.read_csv("mylog", sep = None, iterator = True)

我们的想法是

  • 将时间框架作为索引
  • 第二列(或第二列和第三列)中的其他条目,便于字符串/错误搜索

提前谢谢


Tags: 文件csvtoinfoforreadrootpd
1条回答
网友
1楼 · 发布于 2024-09-29 23:28:22

我的建议是首先解析文件,然后编辑其内容,最后从中创建一个数据帧

import re
import pandas as pd


with open("mylog.txt") as f:
  content = f.read()

# data pattern
p = re.compile(r"(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (\w+) ([\w\[\]]+) (\[[^\]]+]) (.+)")

data = [p.match(line).groups() for line in content.splitlines()]

# this is my guess, you can change the labels according to your problem
columns = ["date", "level", "node", "origin", "message"]

df = pd.DataFrame(data=data, columns=columns)

print(df)

哪些产出:

                       date  ...                                            message
0  2021-01-26T09:40:01.192Z  ...     Accepted password for user root from 127.0.0.1
1  2021-01-26T09:40:01.192Z  ...                                  [Auth]: User root
2  2021-01-26T09:40:01.193Z  ...  Event 24138 : User root@127.0.0.1 logged in as...
3  2021-01-26T09:40:01.268Z  ...  Event 24139 : User root@127.0.0.1 logged out (...
[4 rows x 5 columns]

一旦获得数据帧,就可以执行其他所有选项,比如将日期设置为索引

>>> df.set_index("date")
                         level  ...                                            message
date                            ...                                                   
2021-01-26T09:40:01.192Z  info  ...     Accepted password for user root from 127.0.0.1
2021-01-26T09:40:01.192Z  info  ...                                  [Auth]: User root
2021-01-26T09:40:01.193Z  info  ...  Event 24138 : User root@127.0.0.1 logged in as...
2021-01-26T09:40:01.268Z  info  ...  Event 24139 : User root@127.0.0.1 logged out (...
[4 rows x 4 columns]

编辑:

要检查与此正则表达式不匹配的行,可以执行以下操作

data = [(i, p.match(line)) for (i, line) in enumerate(content.splitlines())]

然后,一旦获得了格式为(<number>, <Match_or_None>)的元组列表,就可以检查哪些行没有被正则表达式匹配识别,并相应地更新正则表达式/question

相关问题 更多 >

    热门问题