API和web抓取

2024-10-01 04:46:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试访问page上文本文件的内容。因为每个文本文件都有不同的url,所以我无法用python生成url,也无法使用Pandas删除内容。因此,我尝试使用API来实现这个。当我为用户令牌执行时,得到如下结果:

{
  "jwt": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjU5MDR9.b9elxkmNj0kmWxDPjal0_mLY9UPg7enoT7Cdg7gN1d0"
}

现在,我不知道如何使用它来访问我上面提到的第一页上的所有文本文件。有人能指导我如何进步吗?你知道吗


Tags: 用户apiurl内容pandaspagejwt指导
1条回答
网友
1楼 · 发布于 2024-10-01 04:46:54

此脚本将从第1页转到最后一页,并选择以.txt结尾的所有链接:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

base_url = 'https://usda.library.cornell.edu'

url = 'https://usda.library.cornell.edu/concern/publications/c821gj76b?locale=en&page=1#release-items'

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

page = 1
while True:
    print('Page no.{}...'.format(page))
    print('-' * 80)

    txt_urls = [a["href"] for a in soup.select('#release-items a[href$=".txt"]')]
    pprint(txt_urls)

    m = soup.select_one('a[rel="next"][href]')
    if m and m['href'] != '#':
        soup = BeautifulSoup(requests.get(base_url + m['href']).text, 'html.parser')
        page += 1
    else:
        break

印刷品:

Page no.1...
                                        
['https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/kd17d5288/ms35tm800/agpr0719.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/r494vw17c/q524jz702/agpr0619.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/bc386t90p/vx021r07n/agpr0519.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/3484zr667/4j03d7561/agpr0419.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/f7623m42k/qf85nk40w/agpr0319.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/7w62fg32b/n009w815n/agpr0219.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/kk91fs55d/z890s0860/agpr0219.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/t435gj88z/8910k0903/agpr0119.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/m613n410w/41687p68x/01-30-19_Report_Reschedule_ASB_Notice_Final.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/st74cv012/0z709086s/agpr1118.txt']
Page no.2...
                                        
['https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/5q47rs05x/m900nx65x/agpr1018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/4b29b953w/m900nx64n/agpr0918.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/5h73px257/1c18dh137/AgriPric-08-29-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/t722hb16b/76537257b/AgriPric-07-30-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/pz50gx32d/qb98mg88k/AgriPric-06-28-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/vd66w115f/p2676w80r/AgriPric-05-30-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/9c67wp20r/bc386k622/AgriPric-04-27-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/r494vm201/h128ng14d/AgriPric-03-28-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/z316q273n/37720f04c/AgriPric-02-27-2018_correction.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/5d86p1433/zp38wd92f/AgriPric-01-30-2018.txt']

...and so on.

您可以使用以下链接下载文本文件,例如:

txt_data = requests.get('https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/kd17d5288/ms35tm800/agpr0719.txt').text
print(txt_data)

打印(但您可以将其保存到文件中,而不是打印到屏幕上):

Agricultural Prices

ISSN: 1937-4216

Released July 31, 2019, by the National Agricultural Statistics Service 
(NASS), Agricultural Statistics Board, United States Department of 
Agriculture (USDA).

June Prices Received Index Up 1.0 Percent 

...etc.

相关问题 更多 >