如何返回python的中间输出并将其保存在spawn进程内的数组中

2024-09-30 12:11:32 发布

您现在位置:Python中文网/ 问答频道 /正文

请在这方面帮助我,我正在努力在nodejs脚本和python脚本之间设置异步调用,因为这个python脚本实际上是一个爬虫程序,需要30秒以上才能得到完整的结果,为了解决超时错误,我想向连接的nodejs脚本发送中间输出(我的意思是URL)

您可以看到错误i的图像

 Image

在我的python脚本中,我通过访问所有href present来处理主页url,然后在它们的HTML内容中搜索给定的关键字,如果我找到该关键字,我将返回该url。我面临的问题是,这个python脚本一次返回所有URL的字符串,而我想在中间发送它,因为我认为这将解决heroku上的请求超时错误

比如说,

在下面的python脚本中,您将看到print语句if(search_keyword(u, search_key)): **print(u)**,如果我们找到搜索到的关键字,它将发送URL,然后这个python将把它发送到app.js中的spawn进程

# For loop of iterating the urls in the input urls
for url in input_urls:
flag = 1
# print("----------------PROCESSING NEW URL---------------URL [ "+url+" ]")
try:          
    content = selenium_calling(url)
    soup = BeautifulSoup(content,'html.parser')
    search_string = re.sub("\s+"," ", soup.body.text)
    if(search_keyword(url, search_key)):
        if(full_search == 0):
            continue
    anchor_tags_list = soup.find_all('a')
    anchor_tags_list = list(set(anchor_tags_list))
    if(len(anchor_tags_list) != 0):
        base = url
        base = base.rstrip("/")
        domain_name = base.split('/')[2]
        correctUrlList = []      
        for link in anchor_tags_list:
            if(link.has_attr('href')):
                correctURL = ""
                if(domain_name in link.get('href')):
                    if(re.search(regex, link.get('href'))):
                        correctURL = link.get('href')
                        correctUrlList.append(correctURL)
                elif(not link.get('href').startswith(('www', '/www')) and link.get('href').startswith(('./', '/'))):
                    urladder = link.get('href')
                    correctURL = base+urladder.lstrip(".")
                    if(re.search(regex, correctURL)):
                        correctUrlList.append(correctURL)
                elif('http' not in link.get('href') and not link.get('href').startswith(('www', '/www'))):
                    urladder = link.get('href')
                    correctURL = base+"/"+urladder.lstrip(".")
                    if(re.search(regex, correctURL)):
                        correctUrlList.append(correctURL)
        correctUrlList = list(set(correctUrlList))
        correctUrlList = list(filter(None, correctUrlList))

        if(url not in correctUrlList):
            correctUrlList.append(url)
    else:
        error_urls.append(url)
        continue
    if(len(correctUrlList) < 1):
        error_urls.append(url)
        continue
    for u in correctUrlList:        
        try:               
            if(search_keyword(u, search_key)):
                **print(u)**
                found_results.append(u)
                if(full_search == 0):
                    break
            else:
                notfound_results.append(u)
        except Exception as err:
            pass
    continue
except Exception as err:
    error_urls.append(url)
    pass

下面的代码是针对nodejs脚本app.js的,在该脚本中,您将看到我使用console.log获取中间python输出,但在该数组中不会发生这种情况outarr所有返回项都存储在0索引中,而不是单独的索引中

// Send request to python script
var spawn = require('child_process').spawn;
var process = spawn('python', ["./webextraction.py", csvData, keywords, req.body.full_search])

var outarr = []

process.stdout.on('data', function(data){

    outarr.push(data.toString().trim())

    console.log(outarr[0])

});

请帮忙


Tags: in脚本urlsearchbasegetiflink

热门问题