使用Beautifulsoup-Python进行数据抓取

2024-09-28 17:02:40 发布

您现在位置:Python中文网/ 问答频道 /正文

enter image description here我试图从链接中获取一些数据:http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100 例如,我试图用beauthoulsoup提取每个评审员的姓名,但这不起作用。我以前尝试过用其他网站美化组,它工作完美!我不知道发生了什么。你能帮助我吗。代码如下:

from bs4 import BeautifulSoup
import os
import urllib.request


file1 = open(os.path.expanduser(r"~/Desktop/Skytrax Reviews1.csv"), "wb")

file1.write(b"Reviewer" + b"\n")

WebSites = ["http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100"]


# looping through each site until it hits a break. I will create a loop. It is not ready yet
for theurl in WebSites:
    thepage = urllib.request.urlopen(theurl)
    print(thepage)
    soup = BeautifulSoup(thepage,'lxml')
    print(soup)    #<-------This is the main problem 

#Maybe it is not correct too but the main problem is at the above lines
    for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}).text:
        print(Reviewer)

        Record1 = Reviewer
        file1.write(bytes(Record1, encoding="ascii", errors='ignore') + b"\n")


file1.close()

Tags: theimportcomhttpiswwwfile1reviews
2条回答

如果使用Chrome Network ToolsFirebug打开该网站,您会发现它使用cookies来验证请求。在

您可以通过使用Python创建一个dict来模拟cookies,并将它们与您的请求一起发送。在

在我的示例中,我使用requests。另外,你不应该把.text放在循环中,它会给你一个错误。在

from bs4 import BeautifulSoup
import requests

cookies = {
'PHPSESSID':'1gd0sknluds2uvumsglth523g5',
'visid_incap_965359':'UGNtvJR1TAmP1y+/M85QuJ1s3lgAAAAAQUIPAAAAAAB5IOYuRCw/9mMOpTnRDCJ6',
'incap_ses_315_965359':'PRZ8WIgqnhyeicz5PxxfBLFs3lgAAAAAYWoblc6exwqhEeGRPqgA5Q=='
}

response = requests.get('http://www.airlinequality.com/airline-
reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100', cookies=cookies)
soup = BeautifulSoup(response.content, "html.parser")
for Reviewer in soup.findAll(attrs={"class": "text_sub_header userStatusWrapper"}):
    print(Reviewer.get_text(strip=True))

Cookies

网站没有返回您在浏览器上看到的内容,请尝试:

wget -qO- http://www.airlinequality.com/airline-reviews/vietjetair/?sortby=post_date%3ADesc&pagesize=100

或者尝试更改请求的用户代理。在

相关问题 更多 >