TripAdvisor scraping Python脚本正在导出多个不同版本的行

import csv from selenium import webdriver import time # default path to file to store data path_to_file = "D:\Documents\Archaeology\Projects\Patmos\scraped\monastery6.csv" # default number of scraped pages num_page = 5 # default tripadvisor website of hotel or things to do (attraction/monument) url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html" #url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html" # if you pass the inputs in the command line if (len(sys.argv) == 4): path_to_file = sys.argv[1] num_page = int(sys.argv[2]) url = sys.argv[3] # import the webdrive -- NMS 20210705 driver = webdriver.Chrome("C:/Users/nsusm/AppData/Local/Programs/Python/Python39/webdriver/bin/chromedriver.exe") driver.get("https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html") # open the file to save the review csvFile = open(path_to_file, 'a') csvFile = open(path_to_file, 'a', encoding="utf-8") csvWriter = csv.writer(csvFile, delimiter=',') csvWriter.writerow([str ('title'), str ('rating'), str ('review'), str ('date')]) # change the value inside the range to save more or less reviews for i in range(0, 48, 1): # expand the review time.sleep(2) # define container (this is the whole box of the Trip Advisor review, excluding the date of the review) container = driver.find_elements_by_xpath(".//div[@class='review-container']") # grab also the date of review date = driver.find_elements_by_xpath(".//class[@class='prw_reviews_stay_date_hsx']") for j in range(len(container)): rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3] title = container[j].find_element_by_xpath(".//div[contains(@class, noQuotes)]").text.replace("\n", " ") review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", " ") date = " ".join(date[j].text.split(" ")[-2:]) #write data into csv csvWriter.writerow([title, rating, review, date]) # change the page driver.find_element_by_xpath('.//a[@class="nav next ui_button primary"]').click() #quite selenium driver.quit() #FYI you need to close all windows for the file to write ```

1条回答

网友

1楼 · 发布于 2024-09-29 23:24:45

那个日期查找器回来时是空的，所以[j]没能找到。审阅日期在容器中，因此您可以将其与其他内容一起使用

    rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
    person = container[j].find_element_by_class_name('info_text').text.split("\n")[0]#person but not place
    title = container[j].find_element_by_css_selector('span.noQuotes').text.replace("\n", "  ")
    review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
    review_date = container[j].find_element_by_class_name('ratingDate').text[9:]

更改：只是标题的范围，而不是整个div。添加了查找人员的代码（第二行的位置）在容器中找到日期并删除“已审阅”

相关问题更多 >

编程相关推荐

热门问题

热门文章