Python Web抓取BeautifulSoup&CSV

2024-09-29 23:29:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我希望从一个城市和许多城市中提取生活成本的变化。我计划在一个CSV文件中列出我想比较的城市,并使用这个列表来创建一个网络链接,将带我到该网站与我正在寻找的信息。在

下面是一个示例的链接:http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city

不幸的是,我遇到了几个挑战。如果您能为以下挑战提供帮助,我们将不胜感激!在

  1. 产量只显示百分比,但没有显示它是贵还是便宜。对于上面列出的示例,基于当前代码的输出显示48%、129%、63%、43%、42%和42%。为了纠正这个问题,我添加了一个'if语句'来添加'+'符号,如果更贵,则添加'-'符号。但是,此“if语句”无法正常工作。在
  2. 当我将数据写入CSV文件时,每个百分比都会写入新行。我好像不知道怎么把它写在一行上。在
  3. 与项目2相关)当我将数据写入上面列出的示例的CSV文件时,数据将以下面列出的格式写入。如何更正格式,并将数据写入下面列出的首选格式(也没有百分号)?在

当前CSV格式(注意:'if语句'无法正常工作):

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%

首选CSV格式:

^{pr2}$

这是我当前的代码:

import requests
import csv
from bs4 import BeautifulSoup

#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'

i=0
while i<len(Textfilelistsplit):
    url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
    page  = requests.get(url).text
    soup_expatistan = BeautifulSoup(page)

    #Prepare CSV writer.
    WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
    WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])

    expatistan_table = soup_expatistan.find("table",class_="comparison")
    expatistan_titles = expatistan_table.find_all("tr",class_="expandable")

    for expatistan_title in expatistan_titles:
            percent_difference = expatistan_title.find("th",class_="percent")
            percent_difference_title = percent_difference.span['class']
            if percent_difference_title == "expensiver":
                WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
            else:
                WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
    i+=1

Tags: 文件csv数据示例iftitle格式语句
3条回答

csv.writer.writerow()接受一个序列,并使每个元素成为一个列;通常,您将给它一个带有列的列表,但您将传递字符串;这将添加单个字符作为列。在

只需构建一个列表,然后将其写入CSV文件。在

首先,打开CSV文件一次,而不是针对每个单独的城市;每次打开时,您都会清除该文件。在

import requests
import csv
from bs4 import BeautifulSoup

HomeCity = 'Phoenix'

with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(["City", "Food", "Housing", "Clothes",
                     "Transportation", "Personal Care", "Entertainment"])

    for line in cities:
        city = line.strip()
        url = "http://www.expatistan.com/cost-of-living/comparison/{}/{}".format(
            HomeCity, city)
        resp = requests.get(url)
        soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)

        titles = soup.select("table.comparison tr.expandable")

        row = [city]
        for title in titles:
            percent_difference = title.find("th", class_="percent")
            changeclass = percent_difference.span['class']
            change = percent_difference.span.string
            if "expensiver" in changeclass:
                change = '+' + change
            else:
                change = '-' + change
            row.append(change)
         writer.writerow(row)

答案:

  • 问题1:span的类是一个列表,您需要检查expensiver是否在该列表中。换言之,替换:

    if percent_difference_title == "expensiver" 
    

    有:

    if "expensiver" in percent_difference.span['class']
    
  • 问题2和3:您需要将列值的列表传递给writerow(),而不是字符串。而且,由于每个城市只需要一条记录,所以在循环外调用writerow()(通过trs)。在

其他问题:

  • 在循环之前打开csv文件进行写入
  • 在处理文件时使用^{}上下文管理器
  • 尝试遵循^{}风格指南

以下是修改后的代码:

import requests
import csv
from bs4 import BeautifulSoup

BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'

with open('City.txt') as input_file:
    with open("Expatistan.csv", "w") as output_file:
        writer = csv.writer(output_file)
        writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
        for line in input_file:
            city = line.strip()
            url = BASE_URL.format(home_city=home_city, city=city)
            soup = BeautifulSoup(requests.get(url).text)

            table = soup.find("table", class_="comparison")
            differences = []
            for title in table.find_all("tr", class_="expandable"):
                percent_difference = title.find("th", class_="percent")
                if "expensiver" in percent_difference.span['class']:
                    differences.append('+' + percent_difference.span.string)
                else:
                    differences.append('-' + percent_difference.span.string)
            writer.writerow([city] + differences)

对于只包含一个new-york-city行的City.txt,它生成具有以下内容的Expatistan.csv

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%

你一定要明白我做了什么改变。如果你需要进一步的帮助,请告诉我。在

因此,首先,我们将一个iterable传递给writerow方法,然后该iterable中的每个对象都将被写为用逗号分隔它们。所以如果你给它一个字符串,那么每个字符就会被分开:

WriteResultsFile.writerow('hello there')

^{pr2}$

但是

WriteResultsFile.writerow(['hello', 'there'])

hello,there

这就是为什么你得到的结果

n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%

你剩下的问题都是你的网络错误。首先,当我浏览站点时,用CSS类“comparison”搜索表会得到None。所以我不得不用

expatistan_table = soup_expatistan.find("table","comparison")

现在,你的“如果声明被破坏”是因为

percent_difference.span['class']

返回一个列表。如果我们把它改成

百分比_差异.span['class'][0]

事情会按你期望的方式进行。在

现在,您真正的问题是在最里面的循环中,您会发现单个项目的价格变化百分比。您希望这些项目作为价格差异行中的项目,而不是单个行。因此,我声明一个空列表items,并将percent_difference.span.string附加到该列表中,然后在最内层的循环外写入该行,如下所示:

items = []
for expatistan_title in expatistan_titles:
        percent_difference = expatistan_title.find("th","percent")
        percent_difference_title = percent_difference.span["class"][0]
        print percent_difference_title
        if percent_difference_title == "expensiver":
            items.append('+' + percent_difference.span.string)
        else:
            items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)

最后一个错误是,在while循环中,您重新打开csv文件,并覆盖所有内容,因此最后只有最后一个城市。对所有这些错误(其中许多错误您应该可以在没有帮助的情况下找到)留给我们的是:

#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))

i=0
while i<len(Textfilelistsplit):
    url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
    page  = requests.get(url).text
    print url
    soup_expatistan = BeautifulSoup(page)

    WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])

    expatistan_table = soup_expatistan.find("table","comparison")
    expatistan_titles = expatistan_table.find_all("tr","expandable")

    items = []
    for expatistan_title in expatistan_titles:
            percent_difference = expatistan_title.find("th","percent")
            percent_difference_title = percent_difference.span["class"][0]
            print percent_difference_title
            if percent_difference_title == "expensiver":
                items.append('+' + percent_difference.span.string)
            else:
                items.append('-' + percent_difference.span.string)
    row = [Textfilelistsplit[i]]
    row.extend(items)
    WriteResultsFile.writerow(row)
    i+=1

相关问题 更多 >

    热门问题