Python MySQL脚本速度慢,需要一些advi吗

2024-09-28 19:09:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我是新到这个网站在张贴,我发现了很多答案在这里。你知道吗

我不是最好的python或mysql程序员,但我想创建一个非常简单的脚本, 把一些信息放在数据库里,我可以在那里查找和排序。你知道吗

我正在尝试创建类似DNS请求记录器的东西。 Windows 2003(当前)服务器正在将所有DNS请求记录到文件中。 每天我把这个(新的)日志文件拉到Ubuntu机器上,把它处理成MySQL数据库, 使用sh和python。你知道吗

这里是我想要一些建议的地方,日志文件可能会变大,大约70k行, 使用下面的代码,这可能需要20分钟的处理。有没有什么方法可以加快它的速度(我是不是用了糟糕的编码)或者它是“正常”的。你知道吗

我试着让我的sql命令尽可能具体,只返回所需的命令, 每5000次迭代只使用一次提交

主要轮廓: 检查数据库中是否存在记录,如果不存在,则返回ID。 检查当时是否已经有一个记录的条目(HH:MM,注意到几乎同时有多个查询),如果没有创建条目,否则加1计数。你知道吗

就这样。。你知道吗

提前谢谢愿意看的人。你知道吗

尼克。你知道吗

示例条目:

20141204 23:00:50 172.000.000.000 A .www.google.nl
20141204 23:00:53 172.000.000.000 AAAA .www.google.nl

代码:

#! /usr/bin/python

import os, sys
from datetime import datetime
import MySQLdb

# variables
dnslog = "dns-edited.log"
sqldb = None
sqlcur = None

def leesBestand(bestand):
    with open(bestand,"r") as entries:
        runcount = 0
        for entry in entries:
            # 0:<Datum> 1:<Tijd> 2:<IP Requester> 3:<Record Type> 4:<Domain name>
            words = entry.split()

            datum = setDatum(words[0])
            tijd = setTijd(words[1])

            domainId = checkEntry(words[4][1:],words[3])
            requestId = bestaadRequest(datum,tijd,words[2],domainId)

            if runcount > 5000 :
                dbCommit()
                runcount = 0
            runcount += 1

def dbExecute(sqlstring):
    try :
        global sqlcur
        sqlcur.execute(sqlstring)
    except MySQLdb.Error,e:
        global sqldb
        sqldb.rollback()
        print "[Error-Execute] : " + str(e)

def dbCommit():
    global sqldb
    try :
        print "[Debug] SQL Commit"
        sqldb.commit()
    except MySQLdb.Error,e:
        sqldb.rollback()
        print "[Error-Commit] : " + str(e)

def checkEntry(domain,domaintype):
    global sqlcur
    sqlstring = """SELECT dnslog_domain_id FROM dnslog_domains WHERE `dnslog_domain_name` = '%s' AND 
        `dnslog_domain_type` = '%s'""" % (domain,domaintype)
    dbExecute(sqlstring)

    row = sqlcur.fetchone() 
    if row :
        return row[0]
    else :
        sqlvalues = prepareInsertDomain(domain,domaintype)
        sqlinsert = """INSERT INTO dnslog_domains VALUES ('','%s','%s','%s','%s','%s','%s','%s','%s')
            """ % (sqlvalues[0],sqlvalues[1],sqlvalues[2],sqlvalues[3],sqlvalues[4],sqlvalues[5],sqlvalues[6],sqlvalues[7])
        dbExecute(sqlinsert)
        return sqlcur.lastrowid

def bestaadRequest(datum,tijd,requester,domainId):
    global sqlcur
    sqlstring = """SELECT dnslog_request_id, dnslog_request_count FROM dnslog_requests WHERE `dnslog_domain_id` = '%s' AND
        `dnslog_request_requester` = '%s' AND `dnslog_request_date` = '%s' AND `dnslog_request_time` = '%s' 
        """ % (domainId,requester,datum,tijd)
    dbExecute(sqlstring)

    row = sqlcur.fetchone() 
    if row :
        teller = int(row[1]) + 1
        sqlinsert = """UPDATE `dnslog_requests` SET `dnslog_request_count` = '%s' WHERE `dnslog_request_id` = '%s' """ % (teller, row[0])
        dbExecute(sqlinsert)
        return row[0]
    else :
        sqlinsert = """INSERT INTO dnslog_requests VALUES ('','%s','%s','%s','%s','%s') """ % (domainId,requester,datum,tijd,"1")
        dbExecute(sqlinsert)
        return sqlcur.lastrowid

# Main starting function
if __name__ ==  '__main__':
    dbConnect()

    leesBestand(dnslog)

    dbCommit()

    dbClose()

数据库布局

mysql> describe dnslog_requests;
+--------------------------+-------------+------+-----+---------+----------------+
| Field                    | Type        | Null | Key | Default | Extra          |
+--------------------------+-------------+------+-----+---------+----------------+
| dnslog_request_id        | int(11)     | NO   | PRI | NULL    | auto_increment |
| dnslog_domain_id         | int(11)     | YES  |     | NULL    |                |
| dnslog_request_requester | varchar(15) | NO   |     | NULL    |                |
| dnslog_request_date      | date        | NO   |     | NULL    |                |
| dnslog_request_time      | time        | NO   |     | NULL    |                |
| dnslog_request_count     | int(11)     | NO   |     | NULL    |                |
+--------------------------+-------------+------+-----+---------+----------------+
6 rows in set (0.00 sec)

mysql> describe dnslog_domains;
+---------------------+--------------+------+-----+---------+----------------+
| Field               | Type         | Null | Key | Default | Extra          |
+---------------------+--------------+------+-----+---------+----------------+
| dnslog_domain_id    | int(11)      | NO   | PRI | NULL    | auto_increment |
| dnslog_domain_name  | varchar(250) | NO   |     | NULL    |                |
| dnslog_domain_type  | varchar(15)  | NO   |     | NULL    |                |
| dnslog_domain_part1 | varchar(100) | NO   |     | NULL    |                |
| dnslog_domain_part2 | varchar(100) | NO   |     | NULL    |                |
| dnslog_domain_part3 | varchar(100) | NO   |     | NULL    |                |
| dnslog_domain_part4 | varchar(100) | NO   |     | NULL    |                |
| dnslog_domain_part5 | varchar(100) | NO   |     | NULL    |                |
| dnslog_domain_part6 | varchar(100) | NO   |     | NULL    |                |
+---------------------+--------------+------+-----+---------+----------------+

Tags: noidrequestdomainnullrowwordsvarchar
2条回答

我编辑了代码并提出了一些建议, 此时,处理时间缩短为3分钟 小文件和10分钟的大文件。你知道吗

不是最好的速度,但比以前更好 (特别是因为时间随着较大的DB而增加,如图1所示)

新脚本:

#! /usr/bin/python
import os, sys
from datetime import datetime
import MySQLdb

print "[Start program] " + datetime.now().strftime('%H:%M:%S')

dnslog = "dns-edited.log"
domainInsertValues = []
requestInsertValues = []
domainInsertQuery = """INSERT INTO dnslog_domains (dnslog_domain_name, dnslog_domain_type, dnslog_domain_part1, 
        dnslog_domain_part2, dnslog_domain_part3, dnslog_domain_part4, dnslog_domain_part5, 
        dnslog_domain_part6) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)"""
requestInsertQuery = """INSERT INTO dnslog_requests (dnslog_domain_id, dnslog_request_requester, dnslog_request_date,
        dnslog_request_time, dnslog_request_count) VALUES (%s,%s,%s,%s,%s)"""

sqldb = None
sqlcur = None

def leesBestand(bestand,typeRun):
    with open(bestand,"r") as entries:
        for entry in entries:
            # 0:<Datum> 1:<Tijd> 2:<IP Requester> 3:<Record Type> 4:<Domain name>
            words = entry.split()
            datum = setDatum(words[0])
            tijd = setTijd(words[1])
            if typeRun == "domain" :
                checkEntry(words[4][1:],words[3])
            elif typeRun == "request" :
                domainId = checkDomainId(words[4][1:],words[3])
                bestaadRequest(datum,tijd,words[2],domainId)

def checkEntry(domain,domaintype):
    global sqlcur
    global domainInsertValues
    sqlstring = """SELECT dnslog_domain_id FROM dnslog_domains WHERE `dnslog_domain_name` = '%s' AND 
        `dnslog_domain_type` = '%s'""" % (domain,domaintype)
    dbExecute(sqlstring)
    row = sqlcur.fetchone() 
    if not row :
        add = False
        match1 = [s for s in domainInsertValues if domain in s]
        if not match1 :
            add = True
        else :
            match2 = [s for s in match1 if domaintype in s]
            if not match2 :
                add = True 
        if add:
            sqlvalues = prepareInsertDomain(domain,domaintype)
            sqlstring = (sqlvalues[0],sqlvalues[1],sqlvalues[2],sqlvalues[3],sqlvalues[4],sqlvalues[5],sqlvalues[6],sqlvalues[7])
            domainInsertValues.append(sqlstring)

def checkDomainId(domain,domaintype):
    global sqlcur
    sqlstring = """SELECT dnslog_domain_id FROM dnslog_domains WHERE `dnslog_domain_name` = '%s' AND 
        `dnslog_domain_type` = '%s'""" % (domain,domaintype)
    dbExecute(sqlstring)
    row = sqlcur.fetchone() 
    if not row :
        return 27885
    else :
        return row[0]

def bestaadRequest(datum,tijd,requester,domainId):
    global requestInsertValues

    sqldata = (int(domainId),requester,str(datum),tijd,1)
    match1 = [i for i,s in enumerate(requestInsertValues) if s[0] == int(domainId) and s[1] == requester and s[2] == str(datum) and s[3] == tijd]
    if not match1 : 
        requestInsertValues.append(sqldata)
    else :
        requestInsertValues[i] = (int(domainId),requester,str(datum),tijd, s[4] + 1)

# Main starting function
if __name__ ==  '__main__':
    dbConnect()
    print "[Start] Lees Bestand - Domain " + datetime.now().strftime('%H:%M:%S')    
    leesBestand(dnslog,"domain")
    print "[Start] Execute Many - Domain " + datetime.now().strftime('%H:%M:%S')
    dbExecuteMany(domainInsertQuery,domainInsertValues)
    print "[Start] DB Commit " + datetime.now().strftime('%H:%M:%S')
    dbCommit()
    print "[Start] Lees bestand - Request " + datetime.now().strftime('%H:%M:%S')
    leesBestand(dnslog,"request")
    print "[Start] Execute Many - Request " + datetime.now().strftime('%H:%M:%S')
    dbExecuteMany(requestInsertQuery,requestInsertValues)
    print "[Start] DB Commit " + datetime.now().strftime('%H:%M:%S')
    dbCommit()
    dbClose()
    print "[Stop program] " + datetime.now().strftime('%H:%M:%S')

行计数:

user@host:~/dnslog$ wc -l backups/*
   69913 backups/dns-2014-12-04.log
   57220 backups/dns-2014-12-05.log
   27883 backups/dns-2014-12-06.log
   30679 backups/dns-2014-12-07.log
   62546 backups/dns-2014-12-08.log
   50267 backups/dns-2014-12-09.log
   55166 backups/dns-2014-12-10.log
   57169 backups/dns-2014-12-11.log
   46335 backups/dns-2014-12-12.log
   18632 backups/dns-2014-12-13.log
   18655 backups/dns-2014-12-14.log
  494465 total

SQL表:

mysql> select count(dnslog_domain_id) from dnslog_domains;
+            -+
| count(dnslog_domain_id) |
+            -+
|                   35243 |
+            -+
1 row in set (0.02 sec)

mysql> select count(dnslog_request_id) from dnslog_requests;
+             +
| count(dnslog_request_id) |
+             +
|                   342005 |
+             +
1 row in set (0.16 sec)

mysql> select sum(dnslog_request_count) from dnslog_requests;
+             -+
| sum(dnslog_request_count) |
+             -+
|                    494459 |
+             -+
1 row in set (0.12 sec)

对于那些对执行时间感兴趣的人:

user@host:~/dnslog$ time ./script.py # 69913
[Start program] 14:13:45
[Start] Lees Bestand - Domain 14:13:45
[Start] Execute Many - Domain 14:15:45
[Start] DB Commit 14:15:46
[Start] Lees bestand - Request 14:15:46
[Start] Execute Many - Request 14:29:02
[Start] DB Commit 14:29:15
[Stop program] 14:29:15

real    14m37.079s
user    10m51.780s
sys     0m10.043s


user@host:~/dnslog$ time ./script.py # 57220
[Start program] 14:30:36
[Start] Lees Bestand - Domain 14:30:36
[Start] Execute Many - Domain 14:31:42
[Start] DB Commit 14:31:43
[Start] Lees bestand - Request 14:31:44
[Start] Execute Many - Request 14:59:18
[Start] DB Commit 14:59:28
[Stop program] 14:59:28

real    28m51.763s
user    7m41.154s
sys     0m8.981s


user@host:~/dnslog$ time ./script-v0.4.py # 27883
[Start program] 15:02:26
[Start] Lees Bestand - Domain 15:02:26
[Start] Execute Many - Domain 15:02:36
[Start] DB Commit 15:02:36
[Start] Lees bestand - Request 15:02:36
[Start] Execute Many - Request 15:04:17
[Start] DB Commit 15:04:19
[Stop program] 15:04:19

real    1m53.154s
user    1m35.524s
sys     0m2.226s


user@host:~/dnslog$ time ./script-v0.4.py # 30679
[Start program] 15:04:39
[Start] Lees Bestand - Domain 15:04:39
[Start] Execute Many - Domain 15:04:50
[Start] DB Commit 15:04:50
[Start] Lees bestand - Request 15:04:50
[Start] Execute Many - Request 15:06:42
[Start] DB Commit 15:06:45
[Stop program] 15:06:45

real    2m6.326s
user    1m45.733s
sys     0m2.472s


user@host:~/dnslog$ time ./script-v0.4.py # 62546
[Start program] 15:07:07
[Start] Lees Bestand - Domain 15:07:07
[Start] Execute Many - Domain 15:08:14
[Start] DB Commit 15:08:15
[Start] Lees bestand - Request 15:08:15
[Start] Execute Many - Request 15:16:28
[Start] DB Commit 15:16:38
[Stop program] 15:16:38

real    9m31.162s
user    7m7.657s
sys     0m5.947s


user@host:~/dnslog$ time ./script-v0.4.py # 50267
[Start program] 15:19:02
[Start] Lees Bestand - Domain 15:19:03
[Start] Execute Many - Domain 15:20:09
[Start] DB Commit 15:20:10
[Start] Lees bestand - Request 15:20:10
[Start] Execute Many - Request 15:27:48
[Start] DB Commit 15:27:57
[Stop program] 15:27:57

real    8m54.934s
user    5m54.422s
sys     0m5.197s


user@host:~/dnslog$ time ./script-v0.4.py # 55166
[Start program] 15:48:37
[Start] Lees Bestand - Domain 15:48:37
[Start] Execute Many - Domain 15:51:21
[Start] DB Commit 15:51:21
[Start] Lees bestand - Request 15:51:21
[Start] Execute Many - Request 16:01:14
[Start] DB Commit 16:01:24
[Stop program] 16:01:24

real    12m47.229s
user    7m12.642s
sys     0m6.413s


user@host:~/dnslog$ sudo ./script.sh # 57169 - Oops forgot copy correct script...
[sudo] password for monitor:
Start:  + 10:08:03
Sed / Grep / Awk magic:  + 10:08:09
[Start program] 10:08:10
[Start] Lees Bestand - Domain 10:08:10
[Start] Execute Many - Domain 10:10:53
[Start] DB Commit 10:10:55
[Start] Lees bestand - Request 10:10:55
[Start] Execute Many - Request 11:47:27
[Start] DB Commit 11:47:37
[Stop program] 11:47:37


user@host:~/dnslog$ cp script-v0.4.py script.py


user@host:~/dnslog$ sudo ./script.sh # 46335
[sudo] password for monitor:
Start:  + 11:52:31
Sed / Grep / Awk magic:  + 11:52:36
[Start program] 11:52:36
[Start] Lees Bestand - Domain 11:52:36
[Start] Execute Many - Domain 11:54:56
[Start] DB Commit 11:54:57
[Start] Lees bestand - Request 11:54:57
[Start] Execute Many - Request 12:02:39
[Start] DB Commit 12:02:47
[Stop program] 12:02:47


user@host:~/dnslog$ sudo ./script.sh # 18632
Start:  + 12:06:23
Sed / Grep / Awk magic:  + 12:06:28
[Start program] 12:06:28
[Start] Lees Bestand - Domain 12:06:28
[Start] Execute Many - Domain 12:06:40
[Start] DB Commit 12:06:40
[Start] Lees bestand - Request 12:06:41
[Start] Execute Many - Request 12:07:52
[Start] DB Commit 12:07:55
[Stop program] 12:07:55


user@host:~/dnslog$ sudo ./script.sh # 18655
Start:  + 12:15:36
Sed / Grep / Awk magic:  + 12:15:41
[Start program] 12:15:41
[Start] Lees Bestand - Domain 12:15:41
[Start] Execute Many - Domain 12:15:54
[Start] DB Commit 12:15:54
[Start] Lees bestand - Request 12:15:54
[Start] Execute Many - Request 12:17:06
[Start] DB Commit 12:17:08
[Stop program] 12:17:08

数据库操作总是需要时间的,最好的方法是在windows上保存一个复制DNS请求文件,您可以从中继续处理。你知道吗

但是,每5-10分钟就向后端的ubuntu系统发送一个提交请求。你知道吗

如果我误解了这个问题,请告诉我。你知道吗

相关问题 更多 >