线程之间的对象共享产生非类型

2024-09-30 01:32:02 发布

您现在位置:Python中文网/ 问答频道 /正文

背景:

我正在开发一个web爬虫程序,它生成7个线程,每个线程为一个XML文件查询一个唯一的URL。当每个查询收到响应时,它会将该响应转换为XML树,如下所示:

conn = http.client.HTTPSConnection(host = uHost, port = uPort)
conn.request('GET', url = '/some/url/file.xml')
resp = conn.getresponse()
tree = xml.etree.ElementTree.parse(resp)

当每个线程启动时,都会给它一个queue.Queue()作为参数,这样它就可以将tree放入其中,这样__main__就是唯一一个写文件的线程。从上面继续:

\uuu主要\uuuuuu

def receive(q):
    while True:
        try:
            uTree = q.get()
            uTree.write('/some/path/file.xml')
        except queue.Empty:
            pass

繁殖

conn = http.client.HTTPSConnection(host = uHost, port = uPort)
conn.request('GET', url = '/some/url/file.xml')
resp = conn.getresponse()
tree = xml.etree.ElementTree.parse(resp)
q.put_nowait(tree)

但是,我在调用uTree.write()时开始接收AttributeError: 'NoneType' object has no attribute 'write'。从uTree.write()print(type(uTree))的快速变化表明,对象有时会保持完整,有时则会变成NoneType

<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'NoneType'>
<class 'NoneType'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>

问题:

为什么从threading.Thread()传递到queue.Queue()[驻留在__main__]的对象不一致地更改为NoneType

我如何解决这个问题

完整代码(如果需要):

main.py

import queue
import crawl # custom module
import threading

def crawler(query):
    while True:
        try:
            query.connect()
            break
        except:
            pass

def receive(q):
    while True:
        try:
            uQuery = q.get()
            uTree = uQuery.tree
            uTree.write('/some/path/file.xml')
        except queue.Empty:
            pass

urls = ['/url1.xml', '/url2.xml', ...]

q = queue.Queue()

queries = [Query(url, q) for url in urls]
threads = [threading.Thread(target = crawler, args = (query,)) for query in queres]

for t in threads:
    t.start()

receive(q)

爬网.py

import http.client
import xml.etree.ElementTree as ET

class Query:
    def __init__(self, url, q):
        self.url = url
        self.queue = q
        self.tree = None

    def connect():
        conn = http.Client.HTTPConnect(host = 'something.com', port = '80')
        conn.request('GET', url = self.url)
        resp = conn.getresponse()
        self.tree = ET.parse(resp)
        self.queue.put_nowait(self)
        conn.close()

Tags: importselftreeurlqueuedefxmlconn
1条回答
网友
1楼 · 发布于 2024-09-30 01:32:02

(我愿意评论,但似乎没有这个名声)

这并不能解决您的问题,但可能会给您一些提示

我知道调试线程问题比较困难,但我建议简化您的示例。包括使用ElementTree和HTTP连接解析XML—两者似乎都与问题无关

为了解决您的问题,您还可以通过记录您正在放入队列的内容来获得见解

我建议在将复杂对象(如已解析的树)放入队列时要格外小心。然后需要确保对象本身是线程安全的

如果您不知道,我建议您使用https://scrapy.org/,这将使实现爬虫程序变得更加容易

相关问题 更多 >

    热门问题