我的代码如下所示:
import requests #grabs HTML from remote file
from bs4 import BeautifulSoup #core scraping module
import csv #prints results to a file
import time #adds delay and avoid overloading host server
from urllib.request import urlopen
# create file to hold scraped data.
artist_csv_file = open('artist_data.csv', 'w')
csv_writer = csv.writer(artist_csv_file)
# nominate header columns for the CSV
csv_writer.writerow(['date_text', 'artist', 'track', 'url'])
# read in a list of URLs to scrap (testing with 3 of 500+ pages)
contents = []
with open('Rage-links.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
page = urlopen(url[0]).read()
rage = BeautifulSoup(page, 'html.parser')
# identify page section with artist and track names
playlist = rage.find('div', class_='comp-rich-text article-text clearfix')
# open a CSV file to collect results. And set file's column headers
artist_csv_file = open('artist_data.csv', 'w')
csv_writer = csv.writer(artist_csv_file)
csv_writer.writerow(['date_text', 'artist', 'track', 'url'])
# capture date from the page
rage_date = rage.find('div', class_='view-comp-publish-date')
date_text = rage_date.time.text
# define the artist and define song title as its sibling
artist = playlist.find('strong')
# track = playlist.strong.next_sibling
# loop through the HTML, write date, artist and track data.
for artist in playlist.find_all('strong'):
print(date_text)
print(artist)
print(artist.next_sibling)
print(url)
time.sleep(.1) # delay scrape for .1 seconds
csv_writer.writerow([date_text, artist, artist.next_sibling, url])
# close the CSV file
artist_csv_file.close()
所有的数据都打印到屏幕上,但只有一部分数据到达我的CSV文件中。我是一个新手,所以任何帮助感谢
我想知道我是否像这篇文章(Pandas prints to screen corrently but saves only some data to csv)那样试图一次捕获太多的数据,但解释超出了我的技能水平
由于“w”,这一行用每个循环覆盖您的文件。尝试使用'a'作为append,然后它会在文件末尾附加结果evrey循环
您可能应该在循环之前初始化文件以添加列标题,否则它可能会在每个循环中写入一行新的标题
希望有帮助。玩得开心
相关问题 更多 >
编程相关推荐