我正在尝试编写一个python web爬虫程序,并使其具有多线程功能。我遇到的主要问题是使用ThreadPoolExecutor库并发运行代码
def crawl(self, url):
for link in self.get_links(url):
if link in self.visited:
continue
print("Scraping URL: {}".format(link))
#if not visited add to visited set O(1) time
self.visited.add(link)
info = self.extract_info(link)
return("word")
我的爬网函数只想返回一些字符串
我有一个启动功能,可以启动最多2名工人的游泳池:
def start(self):
job = self.pool.submit(self.crawl(self.startingUrl))
job.add_done_callback(self.appendText)
问题出现在appendText函数中,我试图将future对象转换回字符串以将字符串写入文件:
def appendText(self,res):
print("HELLO!")
print("res = ", res.result())
with open("Crawled.txt","w") as file:
des = "Description: {}".format(res.result())
key = "Keywords:{}".format(res.result())
file.write(des)
file.write(key)
我最终得到了一个TypeError,并且一直在寻找如何将future对象转换为字符串的方法
HELLO! tures
Traceback (most recent call last):
File "crawler/crawler.py", line 78, in <module>
crawler.start()
File "crawler/crawler.py", line 73, in start
job.add_done_callback(self.appendText)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 403, in add_done_callback
fn(self)
File "crawler/crawler.py", line 53, in appendText
print("res = ", res.result())
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 425, in result
return self.__get_result()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 384, in __get_result
raise self._exception
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
TypeError: 'str' object is not callable
我在这方面哪里出错了?谢谢大家!
crawl
返回一个字符串。现在如何使用代码,您将调用crawl
,然后给出返回给submit
的字符串。然后,池将尝试执行作为函数提供给它的字符串,从而导致错误您希望将未调用的函数传递给
submit
,并让它为您调用crawl
:target
是您希望它调用的函数,args
是您希望它调用函数的参数您也可以使用大致相同的方法:
通过将其包装在
lambda
中,可以执行dekat。尽管lambda
有一些开销,但还是更喜欢第一种方法。我把它包括在这里作为参考相关问题 更多 >
编程相关推荐