我正在尝试从https://essentials.swissdox.ch中提取数据,这只适用于VPN。所以我做的是,我用我的查询参数生成了一个URL,并试图得到相应的html文件。问题是,尽管链接起作用,Python给了我https://essentials.swissdox.ch起始页的html文件。我真的很感激任何帮助
相反,我得到了这个页面的html文件:https://essentials.swissdox.ch/View/log/index.jsp?reset=true
以下是我到目前为止的情况:
#Set keywords for URL
keyword_queries = ['lissabon']
startdate = "2007-01-01"
enddate = "2007-01-01"
#Encode and hit URL
for keyword in keyword_queries:
html_keyword= urllib.parse.quote_plus(keyword)
URL = "https://essentials.swissdox.ch/View/log/index.jsp#&search=true&sortorder=pubDateTime%20desc&formdata=%5B%7B%22name%22%3A%22SEARCH_mltid%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_sc%22%2C%22value%22%3A%22swissdox%22%7D%2C%7B%22name%22%3A%22SEARCH_query%22%2C%22value%22%3A%22" + html_keyword + "%22%7D%2C%7B%22name%22%3A%22SEARCH_exact%22%2C%22value%22%3A%22true%22%7D%2C%7B%22name%22%3A%22dateDropdown%22%2C%22value%22%3A%22-1%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_lower%22%2C%22value%22%3A%22" + startdate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_pubDate_upper%22%2C%22value%22%3A%22" + enddate + "%22%7D%2C%7B%22name%22%3A%22SEARCH_tiall%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_source%22%2C%22value%22%3A%22%22%7D%2C%7B%22name%22%3A%22SEARCH_author%22%2C%22value%22%3A%22%22%7D%5D"
weburl = urllib.request.urlopen(URL)
#Hit the url
ua = UserAgent()
page = requests.get(URL, {"User-Agent": ua.random})
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('div', class_='documentlist')
print(page.content)
看起来您在url中使用了“#”而不是“?”。通常使用“?”启动查询参数,在键值对之间用“=”指定
使用“#”意味着跳转到页面中的特定部分,在本例中为https://essentials.swissdox.ch/View/log/index.jsp,这是您得到的响应。将“#”更改为“?”似乎会引发关于原始URL上无效字符的错误。确保在percent encoding查询参数时使用有效字符
Wiki - URL Syntax
相关问题 更多 >
编程相关推荐