用Python刮

2024-05-08 23:13:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我用python编写了一段代码,以便从trip advisor(来自评论的评分)中获取一些数据。问题是,每当我运行代码时,它都会给我不同的行,而且从不删除所有的网页。你知道吗

出现的索引错误是:

Traceback (most recent call last):
  File "C:/Users/thimios/PycharmProjects/TripadvisorScrapping/proxiro.py", line 26, in <module>
    rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
IndexError: list index out of range

代码如下:

from bs4 import BeautifulSoup
import os
import urllib.request

file2 = open(os.path.expanduser(r"~/Desktop/TripAdviser Reviews2.csv"), "wb")        
file2.write(b"Organization,Rating" + b"\n")

WebSites = [
"https://www.tripadvisor.com/Hotel_Review-g189400-d198932-Reviews-Hilton_Athens-Athens_Attica.html#REVIEWS"]

Checker ="REVIEWS"

# looping through each site until it hits a break
for theurl in WebSites:
    thepage = urllib.request.urlopen(theurl)
    soup = BeautifulSoup(thepage, "html.parser")
    #print(soup)

    while True:
        # Extract ratings from the text reviews
        altarray = ""
        for i in range(0,10):
            rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
            rating1 = rating.find_all("span")[0]
            rating2 = rating1['class'][1][-2:]
            print(rating2)
            if len(altarray) == 0:
                altarray = [rating2]
            else:
                altarray.append(rating2)

            #print(altarray)
            #print(len(altarray))
            #print(type(altarray))

            # Extract Organization,
            Organization1 = soup.find(attrs={'class': 'heading_name'})
            Organization = Organization1.text.replace('"', ' ').replace('Review of',' ').strip()
            #print(Organization)



            # Loop through each review on the page
            for x in range(0, 10):
                Rating = altarray[x]
                Rating = str(Rating)
                #print(Rating)
                #print(type(Rating))

                Record2 = Organization + "," + Rating
                if Checker == "REVIEWS":
                    file2.write(bytes(Record2, encoding="ascii", errors='ignore') + b"\n")

                link = soup.find_all(attrs={"class": "nav next rndBtn ui_button primary taLnk"})
                #print(link)
                #print(link[0])
                if len(link) == 0:
                    break
                else:
                   soup = BeautifulSoup(urllib.request.urlopen("http://www.tripadvisor.com" + link[0].get('href')),"html.parser")
                   #print(soup)
                   #print(Organization)
                   print(link[0].get('href'))
                   Checker = link[0].get('href')[-7:]
                   #print(Checker)

        file2.close()

我想旅行顾问并没有完全访问有数据吗主意?你知道吗


Tags: 代码inimportlinkcheckerrangeclassfile2
1条回答
网友
1楼 · 发布于 2024-05-08 23:13:30

尝试按索引访问列表中的元素时遇到错误,该索引不存在。你知道吗

我已经运行了你的代码并打印了:

50
50
50
50
50
50
40
40
40
50

尽管如此,循环的方式并不是最具python风格的方式,而且也容易受到很多索引错误的影响。你知道吗

你能做的就是替换这个:

for i in range(0,10):
    rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]

使用:

for rating in soup.findAll("div", {'class': 'rating reviewItemInline'}) :

这也将解决错误。你知道吗

相关问题 更多 >

    热门问题