背景:
我正在开发一个web爬虫程序,它生成7个线程,每个线程为一个XML文件查询一个唯一的URL。当每个查询收到响应时,它会将该响应转换为XML树,如下所示:
conn = http.client.HTTPSConnection(host = uHost, port = uPort)
conn.request('GET', url = '/some/url/file.xml')
resp = conn.getresponse()
tree = xml.etree.ElementTree.parse(resp)
当每个线程启动时,都会给它一个queue.Queue()
作为参数,这样它就可以将tree
放入其中,这样__main__
就是唯一一个写文件的线程。从上面继续:
\uuu主要\uuuuuu
def receive(q):
while True:
try:
uTree = q.get()
uTree.write('/some/path/file.xml')
except queue.Empty:
pass
繁殖
conn = http.client.HTTPSConnection(host = uHost, port = uPort)
conn.request('GET', url = '/some/url/file.xml')
resp = conn.getresponse()
tree = xml.etree.ElementTree.parse(resp)
q.put_nowait(tree)
但是,我在调用uTree.write()
时开始接收AttributeError: 'NoneType' object has no attribute 'write'
。从uTree.write()
到print(type(uTree))
的快速变化表明,对象有时会保持完整,有时则会变成NoneType
:
<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'NoneType'>
<class 'NoneType'>
<class 'xml.etree.ElementTree.ElementTree'>
<class 'xml.etree.ElementTree.ElementTree'>
问题:
为什么从threading.Thread()
传递到queue.Queue()
[驻留在__main__
]的对象不一致地更改为NoneType
我如何解决这个问题
完整代码(如果需要):
main.py
import queue
import crawl # custom module
import threading
def crawler(query):
while True:
try:
query.connect()
break
except:
pass
def receive(q):
while True:
try:
uQuery = q.get()
uTree = uQuery.tree
uTree.write('/some/path/file.xml')
except queue.Empty:
pass
urls = ['/url1.xml', '/url2.xml', ...]
q = queue.Queue()
queries = [Query(url, q) for url in urls]
threads = [threading.Thread(target = crawler, args = (query,)) for query in queres]
for t in threads:
t.start()
receive(q)
爬网.py
import http.client
import xml.etree.ElementTree as ET
class Query:
def __init__(self, url, q):
self.url = url
self.queue = q
self.tree = None
def connect():
conn = http.Client.HTTPConnect(host = 'something.com', port = '80')
conn.request('GET', url = self.url)
resp = conn.getresponse()
self.tree = ET.parse(resp)
self.queue.put_nowait(self)
conn.close()
(我愿意评论,但似乎没有这个名声)
这并不能解决您的问题,但可能会给您一些提示
我知道调试线程问题比较困难,但我建议简化您的示例。包括使用ElementTree和HTTP连接解析XML—两者似乎都与问题无关
为了解决您的问题,您还可以通过记录您正在放入队列的内容来获得见解
我建议在将复杂对象(如已解析的树)放入队列时要格外小心。然后需要确保对象本身是线程安全的
如果您不知道,我建议您使用https://scrapy.org/,这将使实现爬虫程序变得更加容易
相关问题 更多 >
编程相关推荐