尝试刮应用现在和学习更多的网址,但不能得到它使用美丽的汤和python

2024-09-30 05:21:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在删除此链接:https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds

现在就开始应用,了解更多URL

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re


AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']



html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(html_1,'lxml')
address = soup_1.find('div',attrs={"class" : identity[0]})


for x in address.find_all('a',id = 'html-link'):
    print(x)

我得到的输出链接不起作用:

<a href="https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&amp;intlink=in-amex-cardshop-allcards-apply-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Apply Now</span></div></a>
<a href="charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Learn More</span></div></a>
<a href="https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&amp;intlink=in-amex-cardshop-allcards-apply-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Apply Now</span></div></a>
<a href="charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Learn More</span></div></a>

下面是html代码的图像,我试图从中获取“了解更多”和“了解更多”URL: html code for the links

这是我想从中获取URL的页面部分:

apply now and learn more links

我想知道代码中是否有任何更改,以便我获得所有“立即应用”并了解所有7张卡的URL


Tags: inhttpsdivcomidurlhtmllink
2条回答

你可以修改它来使用你的列表和语法,但是这会得到我相信你想要的链接。请注意,使用find并不能获得所需的内容,但是使用find_allhref=True并获取第一个链接就可以了

nurl  = 'https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds'
npage = requests.get(nurl)
nsoup = BeautifulSoup(npage.text, "html.parser")

# for link in nsoup.find_all('a'):
for link in nsoup.find_all('a', string=re.compile('Apply Now'), href=True)[0:1]:
    print(link.get('href'))
for link in nsoup.find_all('a', string=re.compile('Learn'), href=True)[0:1]:
    print('https://www.americanexpress.com/in/' + link.get('href'))

输出

https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&intlink=in-amex-cardshop-allcards-apply-AmericanExpressPlatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA
https://www.americanexpress.com/in/charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressPlatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA

您要查找的URL并非全部存储在HTML中。需要进一步的请求来返回JSON中的信息。为此,还需要会话ID。例如:

from bs4 import BeautifulSoup
import requests
import json
    
url = 'https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')

for script in soup.find_all('script'):
    if script.contents and "intlUserSessionId" in script.contents[0]:
        json_raw = script.contents[0][script.contents[0].find('{'):]
        json_data = json.loads(json_raw)
        id = json_data["pageData"]["pageValues"]["intlUserSessionId"]

url2 = 'https://acquisition-1.americanexpress.com/api/acquisition/digital/v1/shop/us/cardshop-api/api/v1/intl/content/compare-cards/in/default'
r2 = requests.get(url2, params={'sessionId':id})
json_data = r2.json()

for entry in json_data:
    cta_group = entry["ctaGroup"][0]
    click_url = cta_group['clickUrl']
    print(f"{cta_group['text']} - {click_url}")

    learn_more = entry['learnMore']['ctaGroup'][0]
    print(f"{learn_more['text']} - {learn_more['clickUrl']}")

这将为您提供以下链接:

Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:membershiprewards_credit&feePay=P1
Learn more - credit-cards/membership-rewards-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:travel_platinum&feePay=T1
Learn more - credit-cards/platinum-travel-credit-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:gold_charge&feePay=G4&intlink=mainapplynow
Learn more - charge-cards/gold-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_reserve&feePay=LV&intlink=mainapplynow
Learn more - credit-cards/platinum-reserve-credit-card/
Learn more - credit-cards/jet-airways-platinum-credit-card/
Learn more - credit-cards/jet-airways-platinum-credit-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge
Learn more - charge-cards/platinum-card/
Learn more - credit-cards/payback-card/
Learn more - credit-cards/payback-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:smart_earn&feepay=ES1
Learn more - credit-cards/smart-earn-credit-card/

了解更多URL需要添加站点的基本URL

相关问题 更多 >

    热门问题