使用Beautifulsoup-Python进行数据抓取

from bs4 import BeautifulSoup import os import urllib.request file1 = open(os.path.expanduser(r"~/Desktop/Skytrax Reviews1.csv"), "wb") file1.write(b"Reviewer" + b"\n") WebSites = ["http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100"] # looping through each site until it hits a break. I will create a loop. It is not ready yet for theurl in WebSites: thepage = urllib.request.urlopen(theurl) print(thepage) soup = BeautifulSoup(thepage,'lxml') print(soup) #<-------This is the main problem #Maybe it is not correct too but the main problem is at the above lines for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}).text: print(Reviewer) Record1 = Reviewer file1.write(bytes(Record1, encoding="ascii", errors='ignore') + b"\n") file1.close()

2条回答

网友

1楼 · 编辑于 2024-09-28 17:02:40

如果使用Chrome Network Tools或Firebug打开该网站，您会发现它使用cookies来验证请求。在

您可以通过使用Python创建一个dict来模拟cookies，并将它们与您的请求一起发送。在

在我的示例中，我使用requests。另外，你不应该把.text放在循环中，它会给你一个错误。在

from bs4 import BeautifulSoup
import requests

cookies = {
'PHPSESSID':'1gd0sknluds2uvumsglth523g5',
'visid_incap_965359':'UGNtvJR1TAmP1y+/M85QuJ1s3lgAAAAAQUIPAAAAAAB5IOYuRCw/9mMOpTnRDCJ6',
'incap_ses_315_965359':'PRZ8WIgqnhyeicz5PxxfBLFs3lgAAAAAYWoblc6exwqhEeGRPqgA5Q=='
}

response = requests.get('http://www.airlinequality.com/airline-
reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100', cookies=cookies)
soup = BeautifulSoup(response.content, "html.parser")
for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}):
    print(Reviewer.get_text(strip=True))

网友

2楼 · 编辑于 2024-09-28 17:02:40

网站没有返回您在浏览器上看到的内容，请尝试：

wget -qO- http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100

或者尝试更改请求的用户代理。在

相关问题更多 >

编程相关推荐

热门问题

热门文章