无法使用BeautifulSoup刮取标记,因为使用请求登录不起作用

2024-09-29 22:34:05 发布

您现在位置:Python中文网/ 问答频道 /正文

为了提高价格,你必须登录。这过去是可行的,但现在他们在网站上做了一些改变。下面的代码仍然适用于URL、拍卖、标题和结果,但只适用于返回的价格。如果结果列表中的值为“批量竞价继续”或“批量售出”,则车辆应具有价格价值

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
from multiprocessing import Pool 
from multiprocessing import cpu_count
from IPython.core.interactiveshell import InteractiveShell

# Display all output
InteractiveShell.ast_node_interactivity = "all"
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 100
pd.options.display.max_columns = None

# Scrape all vehicles per auction
data_list = [{"searchScope": "SC0520", #value options unique per auction (SC0520 = Indy 2020)
    "searchMake": "Plymouth",
    "searchModel": "Cuda",
    "searchYearStart": "1970",
    "searchYearEnd": "1971",
    "submit": ""},{"searchScope": "SC0520",
    "searchMake": "Dodge",
    "searchModel": "Challenger",
    "searchYearStart": "1970",
    "searchYearEnd": "1971",
    "submit": ""}]

headers = {
    "Referer": "https://www.mecum.com",
}

login = {"email": "arjenvgeffen@gmail.com",
        "password": "appeltaart13"}

# Get all the newest challenger and cuda lots with the function below
urls = []
title = []
auction = []
results = []
price = []

def newest_vehicles(url):
    with requests.Session() as req:
        r = req.post("https://www.mecum.com/includes/login-action.cfm", data=login)
        for data in data_list:
            for item in range(1, 2): #scrapes one page 
                r = req.post(url.format(item), data=data, headers=headers)
                soup = BeautifulSoup(r.content, 'html.parser')
                target = soup.select("div.lot")
                for tar in target:
                    urls.append(tar.a.get('href'))
                    title.append(tar.select_one("a.lot-title").text)
                    price.append(tar.span.text if tar.span.text else np.NaN)
                    auction.append(tar.select_one("div.lot-number").text.strip())
                    results2 = tar.select("div[class*=lot-image-container]")
                    for result2 in results2:
                        results.append(' '.join(result2['class']))

newest_vehicles("https://www.mecum.com/search/page/{}/")

# There should be 27 unique URLSs
len(urls) #27
len(set(urls)) #27

urls[:2]
title[:2]
results[:2]
auction[:2]
price[:2]

额外问题(除了你的自豪感和很可能被接受的答案之外,你什么都赢不了:) 如果我已经有了一个包含URL的列表,我如何使用这些URL作为函数的输入来获取每个URL的价格。下面的示例粗略估计了每个URL(最终URL列表)。我想一个类似的功能,刮每个网址的价格,但这将需要一些额外的代码登录第一。您可以像这样刮取价格:price=soup.find(“span”,class=(“lot price”)。text

final_urls = ['https://www.mecum.com/lots/SC0520-414334/1970-plymouth-cuda/',
 'https://www.mecum.com/lots/SC0520-414676/1970-plymouth-aar-cuda/',
 'https://www.mecum.com/lots/SC0520-414677/1971-plymouth-cuda-convertible/',
 'https://www.mecum.com/lots/SC0520-414678/1971-plymouth-cuda-convertible/',
 'https://www.mecum.com/lots/SC0520-414733/1971-plymouth-cuda/']

estimate = []
def scrape_estimate(url):
    with requests.Session() as req:
            estimate = []
            r = req.get(url)
            soup = BeautifulSoup(r.content, 'html.parser')    
            est = soup.find(class_=["lot-estimate"])
            if est:
                estimate = est.contents[0]
                estimate = re.sub("[\$\,\n\t\' ']", "", estimate)
            else:
                estimate = np.NaN

            return(estimate)

# Example to check just one URL 
scrape_estimate('https://www.mecum.com/lots/FL0120-397356/1971-plymouth-hemi-cuda-sox-and-martin-pro-stock/')

# Scrapes all the URLs in the final_urls list
p = Pool(cpu_count())
results_estimate = p.map(scrape_estimate, final_urls)#sample_urls
p.close()
p.join()

results_estimate

我真的希望有人能帮我解决这个问题,谢谢


Tags: httpsimportcomurldatawww价格tar

热门问题