Python:requests.get获取错误的html文件

2024-10-01 09:25:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从https://essentials.swissdox.ch中提取数据,这只适用于VPN。所以我做的是,我用我的查询参数生成了一个URL,并试图得到相应的html文件。问题是,尽管链接起作用,Python给了我https://essentials.swissdox.ch起始页的html文件。我真的很感激任何帮助

例如: 我想要以下url的html文件:https://essentials.swissdox.ch/View/log/index.jsp#&search=true&filter_de=la&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22lissabon%22%7D%2C%7B%22name%22%3A%22filter_de%22%2C%22value%22%3A%22de%22%7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%222020-02-04%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%222020-02-04%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22%22%7D%5D

相反,我得到了这个页面的html文件:https://essentials.swissdox.ch/View/log/index.jsp?reset=true

以下是我到目前为止的情况:

#Set keywords for URL
keyword_queries = ['lissabon']
startdate = "2007-01-01"
enddate = "2007-01-01"

#Encode  and hit URL
for keyword in keyword_queries:
    html_keyword= urllib.parse.quote_plus(keyword)
    URL = "https://essentials.swissdox.ch/View/log/index.jsp#&search=true&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22" + html_keyword + "%22%7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%22" + startdate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%22" + enddate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22%22%7D%5D"
    weburl  = urllib.request.urlopen(URL)

    
    #Hit the url
    ua = UserAgent()
    page = requests.get(URL, {"User-Agent": ua.random})
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find('div', class_='documentlist')
    print(page.content)

Tags: 文件httpslogviewtrueurlindexhtml
1条回答
网友
1楼 · 发布于 2024-10-01 09:25:29

看起来您在url中使用了“#”而不是“?”。通常使用“?”启动查询参数,在键值对之间用“=”指定

使用“#”意味着跳转到页面中的特定部分,在本例中为https://essentials.swissdox.ch/View/log/index.jsp,这是您得到的响应。将“#”更改为“?”似乎会引发关于原始URL上无效字符的错误。确保在percent encoding查询参数时使用有效字符

Wiki - URL Syntax

相关问题 更多 >