BeautifulSoup后写入csv文件

2024-05-19 17:03:50 发布

您现在位置:Python中文网/ 问答频道 /正文

使用beauthoulsoup提取一些文本,然后我想将这些条目保存到csv文件中。我的代码如下:

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    saveFile = open("some.csv", "a")
    saveFile.write(str(tdTags_string) + ",")
    saveFile.close()

saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()

它在很大程度上做了我想要的,除了当条目中有一个逗号(“,”)时,它将它视为分隔符并将单个条目拆分为两个不同的单元格(这不是我想要的)。所以我在网上搜索,发现有人建议使用csv模块,我把代码改成:

^{pr2}$

这使得情况变得更糟,现在单词或数字的每个字母/数字都占用csv文件中的一个单元格。例如,如果条目是“Cat”。“C”在一个单元格中,“a”是下一个单元格,“t”是第三个单元格,以此类推

编辑版本:

import urllib2
import re
import csv
from bs4 import BeautifulSoup

SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()

# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()

# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)

    with open("SomeSite.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow([tdTags_string])

第二版:

placeHolder = []

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    placeHolder.append(tdTags_string)

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

更新输出:

u'stuff1'
u'stuff2'
u'stuff3'

输出示例:

u'record1'  u'31 Mar 1901'  u'California'

u'record1'  u'31 Mar 1901'  u'California'

record1     31-Mar-01       California

另一个已编辑的代码(仍有一个问题-跳过下面的一行):

import urllib2
import re
import csv
from bs4 import BeautifulSoup

SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()

# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()

# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")

placeHolder = []

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    #print repr(tdTags_string)
    placeHolder.append(tdTags_string.rstrip('\n'))

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

Tags: csvinimportstringresultopenfindtr
2条回答

对于最近的跳线问题,我找到了答案。而不是

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

使用这个:

^{pr2}$

来源:https://docs.python.org/3/library/functions.html#open。“a”模式是附加模式,其中as“ab”是一种附加模式,同时以二进制文件的形式打开文件,解决了跳过一行的问题。在

with open("some.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow([tdTags_string]) # put in a list

writeFile.writerow将迭代传递给您的内容,因此字符串"foo"将成为f,o,o三个独立的值,将其包装在list中可以防止这种情况,因为writer将迭代列表而不是字符串

您应该打开一次文件,而不是每次都通过循环打开:

^{pr2}$

相关问题 更多 >