写入CSV fi时出现Unicode问题

2024-09-28 01:31:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要一些指导。我正在使用以下代码:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")
i = 0
schools = []

for school in reqSoup:
    x = reqSoup.find_all("a", {"class" : "school-name"})
    while i < len(x):
        for name in x:
            y = x[i].get_text()
            i += 1
            schools.append(y)

with open('usnwr_schools.csv', 'wb') as f:
    writer = csv.writer(f)
        for y in schools:
        writer.writerow([y])

我的问题是em破折号在生成的CSV文件中显示为utf-8。我尝试了几种不同的方法来修复它,但似乎没有任何效果(包括attempting to use regex来摆脱它,以及尝试几年前的.translate method that I found in a StackOverflow问题)。在

我错过了什么?我希望csv结果只包括文本,减去破折号。在

我使用的是python3.5,对Python相当陌生。在


Tags: csvtextnameinimportforgetrequests
2条回答

学会接受Unicode…世界不再是ASCII了。在

假设您在Windows上,使用Excel或记事本查看.CSV,请在Python3上使用以下行。只需进行此更改(并修复帖子的缩进),您甚至可以正确查看非ASCII字符。记事本和Excel类似于文件开头的UTF-8bom签名,utf-8-sig提供了这个签名。在

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:

如果在另一个Python脚本中读取该文件,请确保使用以下命令读取该文件。您阅读的示例b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor'是以二进制模式'rb'读取的。在

^{pr2}$

如果在Linux上,可以使用utf8而不是{}。在

顺便说一句,您可以将循环替换为:

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y])

读回来:

with open('usnwr_schools.csv',encoding='utf-8-sig') as f:
    print(f.read())

输出:

Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington

如果仍希望仅使用ASCII,则可以执行以下操作:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

replacements = {ord('\N{EN DASH}'):'-',
                ord('\N{EM DASH}'):'-',
                ord('\N{ZERO WIDTH SPACE}'):None}

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")

with open('usnwr_schools.csv', 'w', newline='', encoding='ascii') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y.translate(replacements)])

with open('usnwr_schools.csv',encoding='ascii') as f:
    print(f.read())

要删除破折号,请尝试y.replace("—","-").replace("–","-")(第一个是em dash to minus,第二个是en dash to minus)

如果你只需要ASCII码位,你可以用

import string
whitelist=string.printable+string.whitespace
def clean(s):
    return "".join(c for c in s if c in whitelist)

(对于纯英语文本,这会产生大多数合理的结果)

顺便说一句试试

^{pr2}$

因为在python3中,csv.writer采用的文本文件不是python2中那样的二进制文件(您以二进制模式("wb")打开它)

相关问题 更多 >

    热门问题