使用不变的URL和Python进行动态网络抓取

2024-04-25 12:12:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我关注了以前在stackoverflow上遇到的一些问题,但是没有一个能完全解决我的问题。你知道吗

我正试图在一个硬币拍卖网站上拉屎。我能够动态网页报废的第一页,但不能网页报废的其余网页。你知道吗

我遵循urlHow to scrape multiple pages with an unchanging URL - Python 3中提到的步骤

在TodyWallAuctions上,我们没有上面示例中提到的表单数据。你知道吗

当我们更改页面时,它会访问URLhttps://www.todywallaauctions.com/Results.aspx/getSearchResult,但没有页面索引信息。你知道吗

我应该使用什么URL访问第二个页面?你知道吗


Tags: toanurl网页网站with动态硬币
1条回答
网友
1楼 · 发布于 2024-04-25 12:12:44

这个页面使用JavaScript,它使用url getSearchResult从服务器加载XML,然后它生成HTML,并在页面上替换它。所以最后你得到了第二个页面,但是这个页面没有自己的URL,也没有完整的HTML。你知道吗


您必须创建POST到url getSearchResult的请求,并在JSON数据中以'pageTop'的形式发送页码,类似于

{'pageSize':'15','pageTop':'1','whereCondition':'; @MotherCategory = Coins & Paper Money'}

它将返回JSON数据,其中一个字段"d"包含XML所有数据。然后可以使用BeautifulSouplxmlXML中搜索数据。你知道吗

顺便说一句:带有XML的文本有一些带有大写字符的名称,比如ShortDesc,但是代码需要小写字符shortdesc

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.todywallaauctions.com/Results.aspx/getSearchResult'

payload = {
    'pageSize':'15',
    'pageTop': '1',
    'whereCondition':'; @MotherCategory = Coins & Paper Money'
}

for page in range(1, 4):
    print(' -', page, ' -')

    payload['pageTop'] = str(page)
    r = requests.post(url, json=payload)
    #print(r.status_code)

    data = r.json()
    #print(data.keys())

    text = data['d']
    #print(text[:500])

    soup = BS(text)
    for item in soup.find_all('dtlotdata'):
        description = item.find('shortdesc')
        print('>', description.get_text(strip=True).strip())

结果:

 - 1  -
> Rupees 100, 1960, signed P. C. Bhattacharya
> Rupees 2, set of 36 notes with different dates and signatures of all varieties
> George VI, Rupees 5, 2nd issue, 1947, signed C. D. Deshmukh
> Burma, George VI, Rupees 5, 1945, signed C. D. Deshmukh
> George VI, Rupees 5, 1st Issue, 1938, signed J. B. Taylor
> George VI, Rupees 2, 1943, signed J. B. Taylor
> George VI, Rupee 1, 1944
> George V, Rupees 5, 1st issue, 1925
> Embossed Postcard with impression of German East Africa Coins
> Indore State, Silver coat of Arms
> Proof Stamp Ingot, Silver 0.6g, Scinde Dawk, ½ Anna stamp of 1851. Rare.
> Stamp Ingot, Silver 12g, Indo Portuguese, 5 Reis stamp of Maria II. Rare.
> Proof Sterling Silver Ingot, 22g, Aden, Rupees 10 stamp of George VI. Rare.
> Proof Sterling Silver Ingot, 16g, Burma, Rupees 2 stamp of George VI. Rare.
> Proof Sterling Silver Ingot, 18g, Rupees 5 stamp of Queen Victoria. Rare.
 - 2  -
> Proof Sterling Silver Ingot, 22g, Rupees 25 stamp of George V. Rare.
> Silver Token, 10g, Small Savings, Rajasthan Post Office
> Gold Token, 20g, India Post logo
> Indo Portuguese, Large Bronze Medal
> First day cover with hand stamp of Calicut 3.10.75
> An old Stamp Box to keep postage stamps, 1930s
> Copper Badges (4), circa 1950’s, four different
> Indo Portuguese, Large Bronze Medal, 360g
> Mahatma Gandhi, Silver Medallion, 38.73g
> Mahatma Gandhi, Gold Medal, 31.16g
> Mahatma Gandhi, Silver Medallion, 38.73g
> Silver Medallion, 29.16g
> Azad Hind / Tamgah-i-Azadi Medal, Medal
> Vir-i-Hind / Warrior of India an Azad Hind Order, 2nd Class Star Badge
> Sher-i-Hind / Tiger of India an Azad Hind Order
 - 3  -
> Bhavnagar, Star shaped Brass Badge
> Gulmarg Golf Club, Silver Medal
> George VI Coronation Medal, Silver, 83.94g, 12th May 1937
> Silver Jubilee Medal of George V and Queen Mary, Silver, 15.58g
> George V Coronation, Silver Medal, 6.37g, 1911
> George V Coronation Medal, Metal
> Edward VII Coronation Medal 1902, Silver
> Campbell Medical School, Bronze Medal
> Victoria’s Diamond Jubilee Medal, Silver, 83.71g
> Thomason College of Civil Engineering, Roorkee, Silver Prize Medal
> Photographic Society of India, Silver Medal, 78.5g
> Jammu and Kashmir, Bronze Medal
> Hunza-Nagar Badge, Copper, 1891
> Victoria Golden Jubilee, Bronze Medal
> Colonial and Indian Exhibition, London 1886, Copper Medal, 78.95g

编辑:此代码可以下载图像。你知道吗

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.todywallaauctions.com/Results.aspx/getSearchResult'

payload = {
    'pageSize':'15',
    'pageTop': '1',
    'whereCondition':'; @MotherCategory = Coins & Paper Money'
}

for page in range(1, 2):
    print(' -', page, ' -')

    payload['pageTop'] = str(page)
    r = requests.post(url, json=payload)
    #print(r.status_code)

    data = r.json()
    #print(data.keys())

    text = data['d']
    #print(text[:1500])

    soup = BS(text)
    for item in soup.find_all('dtlotdata'):
        #print(''.join(str(x) for x in item.contents))

        shortdesc = item.find('shortdesc').get_text(strip=True).strip()
        print('> shortdesc:', shortdesc)

        listnumber = item.find('listnumber').get_text(strip=True).strip()
        print('> listnumber:', listnumber)

        lotno = item.find('lotno').get_text(strip=True).strip()
        print('> lotno:', lotno)

        imagecount = item.find('imagecount').get_text(strip=True).strip()
        print('> imagecount:', imagecount)

        number = int(imagecount)
        for x in range(1, number+1):
            filename = '{:>04s}-{:>04s}-{:>02d}.jpg'.format(listnumber,lotno,x)
            url = 'https://www.todywallaauctions.com/PhotosThumb/' + filename
            print(url)
            r = requests.get(url)
            with open(filename, 'wb') as f:
                f.write(r.content)

相关问题 更多 >