使用Gazpacho的Python webscraping返回url

2024-09-22 14:22:16 发布

您现在位置:Python中文网/ 问答频道 /正文

如何使用gazpacho从项目返回URL文本

from gazpacho import get, Soup


page = 0
id = 0

try:
    while True:
        page += 1
        url = f'https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/regiao-serrana/petropolis/imoveis?o={page}'

        html = get(url)
        soup = Soup(html)

        offers = soup.find('div', {'class': 'fnmrjs-1 gIEtsI'}, strict=True)


        for item in offers:
            id += 1
            title = item.find('h2', {'class': 'sc-1mbetcw-0 eJfLou sc-ifAKCX jyXVpA'}, strict=True).text
            price = item.find('div', {'class': 'aoie8y-0 hRScWw'}, strict=True).text
            location = item.find('span', {'class': 'sc-7l84qu-1 ciykCV sc-ifAKCX dpURtf'}, strict=True).text
            #link = ????
            print(str(id), "-",title , price, location , link)
            
except KeyboardInterrupt:
        print('Interrupted using CTRL + C')

另外,我发现在所有页面上运行添加+1值的方式并不好,因为如果我达到一个不存在的值,它将在第一页开始循环,如果你们有任何想法处理它,我将不胜感激


Tags: textidtrueurlgethtmlpagefind
1条回答
网友
1楼 · 发布于 2024-09-22 14:22:16

要获取以下链接,您可能需要搜索所有li标记并提取锚

例如:

from gazpacho import get, Soup

url = f'https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/regiao-serrana/petropolis/imoveis?o=1'
offers = Soup(get(url)).find(
    'li',
    {"class": "sc-1fcmfeb-2 juiJqh"},
    partial=True,
)
follow_links = [
    o.find("a").attrs.get("href") for o in offers if o.find("a")
]
print("\n".join(follow_links))

输出:

https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/casa-845800716
https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/casa-para-locacao-15-min-a-pe-do-centro-historico-845797561
https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/apartamento-de-quarto-e-sala-no-centro-791808170
https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/comercio-e-industria/self-storage-guarda-moveis-estoque-e-documentos-548419653
https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/apartamento-841458075
https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/apartamento-02-quartos-sendo-01-suite-mosela-petropolis-rj-805359360
https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/cobertura-duplex-com-2-vagas-proximo-ao-centro-petropolis-rj-818116827
https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/condominio-pedra-do-retiro-759441464
and much more ...

编辑:

要解决再次循环相同页面的问题,只需获取可用页面的总数并在range()中使用它

把所有这些放在一起,你可能会有这样的情况:

from gazpacho import get, Soup

base_url = f'https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/regiao-serrana/petropolis'
s = Soup(get(f"{base_url}/imoveis?o=1"))

total_pages = int(
    "".join(
        i.strip().split()[-1] for i
        in s.find("p", {"color": "dark"})
        if "Página" in i.text
    )
)


for page in range(1, total_pages + 1):
    s = Soup(get(f"{base_url}/imoveis?o={page}"))
    offers = s.find('li', {"class": "sc-1fcmfeb-2 juiJqh"}, partial=True)

    follow_links = [
        o.find("a").attrs.get("href") for o in offers if o.find("a")
    ]
    descriptions = [
        o.find("h2", {"color": "dark"})[0].text for o in offers if o.find("h2")
    ]
    prices = [
        o.find("p", {"class": "sc-ifAKCX eoKYee"}).text
        if o.find("p", {"class": "sc-ifAKCX eoKYee"}) else "N/A"
        for o in offers
    ]

    search_results = zip(descriptions, follow_links, prices)
    bar = '-' * max(len(l) for l in follow_links)

    print(
        "\n".join(
            f"{desc}\nPrice: {price}\nURL: {link}\n{bar}" for
            desc, link, price in search_results
        )
    )

输出:

Casa de condomínio à venda com 5 dormitórios em Corrêas, Petrópolis cod:2610
Price: R$ 2.000.000
URL: https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/casa-de-condominio-a-venda-com-5-dormitorios-em-correas-petropolis-cod-2610-821071980
                                                                             
Apartamento todo Reformado no Centro de Petrópolis
Price: N/A
URL: https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/apartamento-todo-reformado-no-centro-de-petropolis-845056149
                                                                             
Casa
Price: N/A
URL: https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/casa-845800716
                                                                             
Casa para locação (15 min. a pé do centro histórico)
Price: R$ 749.990
URL: https://rj.olx.com.br/serra-angra-dos-reis-e-regiao/imoveis/casa-para-locacao-15-min-a-pe-do-centro-historico-845797561
                                                                             
and more ...

相关问题 更多 >