请在这方面帮助我,我正在努力在nodejs脚本和python脚本之间设置异步调用,因为这个python脚本实际上是一个爬虫程序,需要30秒以上才能得到完整的结果,为了解决超时错误,我想向连接的nodejs脚本发送中间输出(我的意思是URL)
您可以看到错误i的图像
在我的python脚本中,我通过访问所有href present来处理主页url,然后在它们的HTML内容中搜索给定的关键字,如果我找到该关键字,我将返回该url。我面临的问题是,这个python脚本一次返回所有URL的字符串,而我想在中间发送它,因为我认为这将解决heroku上的请求超时错误
比如说,
在下面的python脚本中,您将看到print语句if(search_keyword(u, search_key)): **print(u)**
,如果我们找到搜索到的关键字,它将发送URL,然后这个python将把它发送到app.js中的spawn进程
# For loop of iterating the urls in the input urls
for url in input_urls:
flag = 1
# print("----------------PROCESSING NEW URL---------------URL [ "+url+" ]")
try:
content = selenium_calling(url)
soup = BeautifulSoup(content,'html.parser')
search_string = re.sub("\s+"," ", soup.body.text)
if(search_keyword(url, search_key)):
if(full_search == 0):
continue
anchor_tags_list = soup.find_all('a')
anchor_tags_list = list(set(anchor_tags_list))
if(len(anchor_tags_list) != 0):
base = url
base = base.rstrip("/")
domain_name = base.split('/')[2]
correctUrlList = []
for link in anchor_tags_list:
if(link.has_attr('href')):
correctURL = ""
if(domain_name in link.get('href')):
if(re.search(regex, link.get('href'))):
correctURL = link.get('href')
correctUrlList.append(correctURL)
elif(not link.get('href').startswith(('www', '/www')) and link.get('href').startswith(('./', '/'))):
urladder = link.get('href')
correctURL = base+urladder.lstrip(".")
if(re.search(regex, correctURL)):
correctUrlList.append(correctURL)
elif('http' not in link.get('href') and not link.get('href').startswith(('www', '/www'))):
urladder = link.get('href')
correctURL = base+"/"+urladder.lstrip(".")
if(re.search(regex, correctURL)):
correctUrlList.append(correctURL)
correctUrlList = list(set(correctUrlList))
correctUrlList = list(filter(None, correctUrlList))
if(url not in correctUrlList):
correctUrlList.append(url)
else:
error_urls.append(url)
continue
if(len(correctUrlList) < 1):
error_urls.append(url)
continue
for u in correctUrlList:
try:
if(search_keyword(u, search_key)):
**print(u)**
found_results.append(u)
if(full_search == 0):
break
else:
notfound_results.append(u)
except Exception as err:
pass
continue
except Exception as err:
error_urls.append(url)
pass
下面的代码是针对nodejs脚本app.js
的,在该脚本中,您将看到我使用console.log获取中间python输出,但在该数组中不会发生这种情况outarr
所有返回项都存储在0索引中,而不是单独的索引中
// Send request to python script
var spawn = require('child_process').spawn;
var process = spawn('python', ["./webextraction.py", csvData, keywords, req.body.full_search])
var outarr = []
process.stdout.on('data', function(data){
outarr.push(data.toString().trim())
console.log(outarr[0])
});
请帮忙
目前没有回答
相关问题 更多 >
编程相关推荐