从抓取到CSV fi

from urllib.request import urlopen, Request from bs4 import BeautifulSoup import os import random import re from itertools import cycle def cleanhtml(raw_html): cleanr = re.compile('<.*?>') #cleaning the strings from these terms cleantext = re.sub(cleanr, '', raw_html) return cleantext def scrape(url, filename, number_id): """ This function scrapes a web page looking for text inside its html structure and saves it in .txt file. So it works only for static content, if you need text in a dynamic part of the web page (e.g. a banner) look at the other file. Pay attention that the retrieved text must be filtered out in order to keep only the part you need. url: url to scrape filename: name of file where to store text number_id: itis appended to the filename, to distinguish different filenames """ #here there is a list of possible user agents user_agent = random.choice(user_agent_list) req = Request(url, headers={'User-Agent': user_agent}) page = urlopen(req).read() # parse the html using beautiful soup and store in variable 'soup' soup = BeautifulSoup(page, "html.parser") row = soup.find_all(class_="row") for element in row: viaggio = element.find_all(class_="nowrap") Partenza = viaggio[0] Ritorno = viaggio[1] Viaggiatori = viaggio[2] Costo = viaggio[3] Title = element.find(class_="taglist bold") Content = element.find("p") Destination = Title.text Review = Content.text Departure = Partenza.text Arrival = Ritorno.text Travellers = Viaggiatori.text Cost = Costo.text TuristiPerCasoList = [Destination, Review, Departure, Arrival, Travellers, Cost] print(TuristiPerCasoList)

1条回答

网友

1楼 · 发布于 2024-09-26 18:14:41

在每次迭代中，都要重新分配TuristiPerCasoList值。
实际需要的是list的list，其中字符串是特定单元格的值，第二个列表包含一行的值，第一个列表包含所有行。在

为此，您应该在主列表中附加一个表示行的列表：

# instead of
TuristiPerCasoList = [Destination, Review, Departure, Arrival, Travellers, Cost]
# use
TuristiPerCasoList.append([Destination, Review, Departure, Arrival, Travellers, Cost])

相关问题更多 >

编程相关推荐

热门问题

热门文章