使用python3和SQLi的大容量插入性能较差

<table> <tr> <th>Configuration</th> <th>Action</th> <th>Time</th> <th>Notes</th> </tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> </table>

2条回答

网友

1楼 · 编辑于 2024-06-26 01:39:08

列“url”的唯一约束是在url上创建隐式索引。这就解释了尺寸的增加。在

我不认为您可以填充表，然后添加unique约束。在

你的瓶颈肯定是CPU。尝试以下操作：

安装工具z:pip install toolz

使用此方法：

from toolz import partition_all

def add_blacklist_url(self, urls):
    # print('add_blacklist_url:: entries = {}'.format(len(urls)))
    start_time = datetime.now()
    for batch in partition_all(100000, urls):
        try:
            start_commit = datetime.now()
            self.cursor.executemany('''INSERT OR IGNORE INTO blacklist(url) VALUES(:url)''', batch)
            end_commit = datetime.now() - start_commit
            print('add_blacklist_url:: total time for INSERT OR IGNORE INTO blacklist {} entries = {}'.format(len(templist), end_commit))
        except sqlite3.Error as e:
            print("add_blacklist_url:: Database error: %s" % e)
        except Exception as e:
            print("add_blacklist_url:: Exception in _query: %s" % e)
    self.db.commit()
    time_elapsed = datetime.now() - start_time
    print('add_blacklist_url:: total time for {} entries = {}'.format(records, time_elapsed))

代码未经测试。在

网友

2楼 · 编辑于 2024-06-26 01:39:08

默认情况下，SQLite使用自动提交模式。这允许省略begin transaction。但是这里我们希望所有的插入都在一个事务中，唯一的方法就是用begin transaction启动一个事务，这样所有要运行的语句都在该事务中。在

方法executemany只是在Python外部对execute执行的循环，它只调用SQLite prepare语句函数一次。在

以下是从列表中删除最后N项的一种非常糟糕的方法：

    templist = []
    i = 0
    while i < self.bulk_insert_entries and len(urls) > 0:
        templist.append(urls.pop())
        i += 1

最好这样做：

^{pr2}$

切片和del切片甚至可以在空列表上工作。在

两者的复杂度可能相同，但100K次append和pop调用的成本远高于让Python在解释器之外执行。在

相关问题更多 >

编程相关推荐

热门问题

热门文章