我试图从链接中获取一些数据:http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100 例如,我试图用beauthoulsoup提取每个评审员的姓名,但这不起作用。我以前尝试过用其他网站美化组,它工作完美!我不知道发生了什么。你能帮助我吗。代码如下:
from bs4 import BeautifulSoup
import os
import urllib.request
file1 = open(os.path.expanduser(r"~/Desktop/Skytrax Reviews1.csv"), "wb")
file1.write(b"Reviewer" + b"\n")
WebSites = ["http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100"]
# looping through each site until it hits a break. I will create a loop. It is not ready yet
for theurl in WebSites:
thepage = urllib.request.urlopen(theurl)
print(thepage)
soup = BeautifulSoup(thepage,'lxml')
print(soup) #<-------This is the main problem
#Maybe it is not correct too but the main problem is at the above lines
for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}).text:
print(Reviewer)
Record1 = Reviewer
file1.write(bytes(Record1, encoding="ascii", errors='ignore') + b"\n")
file1.close()
如果使用Chrome
Network Tools
或Firebug
打开该网站,您会发现它使用cookies
来验证请求。在您可以通过使用Python创建一个
dict
来模拟cookies,并将它们与您的请求一起发送。在在我的示例中,我使用requests。另外,你不应该把
.text
放在循环中,它会给你一个错误。在网站没有返回您在浏览器上看到的内容,请尝试:
或者尝试更改请求的用户代理。在
相关问题 更多 >
编程相关推荐