TripAdvisor scraping Python脚本正在导出多个不同版本的行

2024-09-29 23:24:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在为一篇学术研究论文撰写这篇草稿。我绝对是一个新手,自学成才,并且已经拼凑起来了

我想要的是:一个大约560行的csv;每个日期(mdyyyy)、审核、评级和用户名(用户名目前未计入脚本,仅供参考)各一列

我已经让它运行没有错误,但输出是错误的。我有上千行。该脚本正在以多种格式循环和输出数据:1)带月份/日期的500ish行和审阅2)带评级的500ish行和审阅3)带名称、日期、审阅的500ish行都在同一列中。。。。等等

我花了几个小时试图解决这个问题,现在我有了另一个:

回溯(最近一次呼叫最后一次): 第49行,在 date=“”.join(date[j].text.split(“”[-2:]) 索引器:列表索引超出范围

在3.9.6中运行这个,如果这有区别的话

我有三个问题:

  1. 如何解决此日期超出范围的问题

  2. 脚本是否有任何明显的错误导致它创建了数千行不同的格式

  3. 如何在中添加用户名?我尝试过这样做,但似乎找不到正确的xpath。这是我正在抓取的网站:https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html

import csv
from selenium import webdriver
import time

# default path to file to store data
path_to_file = "D:\Documents\Archaeology\Projects\Patmos\scraped\monastery6.csv"

# default number of scraped pages
num_page = 5

# default tripadvisor website of hotel or things to do (attraction/monument) 
url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"
#url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"

# if you pass the inputs in the command line
if (len(sys.argv) == 4):
    path_to_file = sys.argv[1]
    num_page = int(sys.argv[2])
    url = sys.argv[3]

# import the webdrive -- NMS 20210705
driver = webdriver.Chrome("C:/Users/nsusm/AppData/Local/Programs/Python/Python39/webdriver/bin/chromedriver.exe")
driver.get("https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html")

# open the file to save the review
csvFile = open(path_to_file, 'a')
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile, delimiter=',')
csvWriter.writerow([str ('title'), str ('rating'), str ('review'), str ('date')])

# change the value inside the range to save more or less reviews
for i in range(0, 48, 1):

    # expand the review 
    time.sleep(2)

# define container (this is the whole box of the Trip Advisor review, excluding the date of the review)
    container = driver.find_elements_by_xpath(".//div[@class='review-container']")
    
# grab also the date of review
    date = driver.find_elements_by_xpath(".//class[@class='prw_reviews_stay_date_hsx']")

    for j in range(len(container)):

        rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
        title = container[j].find_element_by_xpath(".//div[contains(@class, noQuotes)]").text.replace("\n", "  ")
        review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
        date = " ".join(date[j].text.split(" ")[-2:])
                                                  
#write data into csv
        csvWriter.writerow([title, rating, review, date])
        
# change the page            
    driver.find_element_by_xpath('.//a[@class="nav next ui_button primary"]').click()

#quite selenium
driver.quit()
                                                  
#FYI you need to close all windows for the file to write ```



Tags: ofcsvthetodatebycontainerdriver
1条回答
网友
1楼 · 发布于 2024-09-29 23:24:45

那个日期查找器回来时是空的,所以[j]没能找到。审阅日期在容器中,因此您可以将其与其他内容一起使用

    rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
    person = container[j].find_element_by_class_name('info_text').text.split("\n")[0]#person but not place
    title = container[j].find_element_by_css_selector('span.noQuotes').text.replace("\n", "  ")
    review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
    review_date = container[j].find_element_by_class_name('ratingDate').text[9:]

更改:只是标题的范围,而不是整个div。 添加了查找人员的代码(第二行的位置) 在容器中找到日期并删除“已审阅”

相关问题 更多 >

    热门问题