清理CSV表时使用WebScrap的问题

2024-09-29 17:21:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从一张表中刮取一些数据。我得到了我期望的结果,但我找不到将它们保存在干净的CSV表中的方法。这是代码,下面是结果和我想要的。有什么建议吗

from bs4 import BeautifulSoup
import urllib.request # web access
import csv
import re

url = "https://wsc.nmbe.ch/family/87/Senoculidae"
page = urllib.request.urlopen(url) # conntect to website
try:
    page = urllib.request.urlopen(url)
except:
    print("Ups!")
soup = BeautifulSoup(page, 'html.parser')

regex = re.compile('^speciesTitle')
content_lis = soup.find_all('div', attrs={'class': regex})

for li in content_lis:
    con = li.get_text("#",strip=True).split("\n")[0]
    print(con)

我得到了这些很好的输出:

Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil
Senoculus barroanus#Chickering, 1941#|#| Panama
Senoculus bucolicus#Chickering, 1941#|#| Panama

但我需要这样的东西(CSV由分号或制表符分隔):

Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil
Senoculus barroanus;Chickering1941;Panama
Senoculus bucolicus;Chickering, 1941;Panama

如何删除字符“|”和一些空格?有什么建议吗

致意


Tags: csvimportreurlrequestpageurllib建议
3条回答

此代码基于示例数据集工作:

lst=[
'Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil',
'Senoculus barroanus#Chickering, 1941#|#| Panama',
'Senoculus bucolicus#Chickering, 1941#|#| Panama'
]

lst2 = [s.replace('|',"").split('#') for s in lst]

lst3=[]

for s in lst2:
   lst3.append(';'.join([sx.strip() for sx in s]).replace(';;',';'))

for s in lst3:
   print(s)

输出

Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil 
Senoculus barroanus;Chickering, 1941;Panama 
Senoculus bucolicus;Chickering, 1941;Panama

-根据请求者意见进行更新-

在最终循环中添加一行:

for li in content_lis:
    con = li.get_text("#",strip=True).split("\n")[0]
    con = ';'.join(sx.strip() for sx in con.replace('|',"").split('#')).replace(';;',';') # add this line
    print(con)

嗨,我看了一下。在我看来,找到你想要的每一条信息的路径可能会更好,因为它拾取了你可能不想要的其他东西。我编辑以逗号分隔,并删除了条,但仍然存在一些小问题

from bs4 import BeautifulSoup
import urllib.request # web access
import csv
import re

url = "https://wsc.nmbe.ch/family/87/Senoculidae"
page = urllib.request.urlopen(url) # conntect to website
try:
    page = urllib.request.urlopen(url)
except:
    print("Ups!")
soup = BeautifulSoup(page, 'html.parser')

#regex = re.compile('^speciesTitle')

for div in soup.find_all('div', attrs={'class': "speciesTitle"}):
    con = div.get_text(',',strip=True).split("\n")[0].replace('|,|','')
    print(con)
    

试试这个:

from bs4 import BeautifulSoup
import urllib.request # web access
import re

url = "https://wsc.nmbe.ch/family/87/Senoculidae"
page = urllib.request.urlopen(url) # conntect to website
try:
    page = urllib.request.urlopen(url)
except:
    print("Ups!")
soup = BeautifulSoup(page, 'html.parser')
#div = soup.find(text=True, recursive=)
regex = re.compile('^speciesTitle')
content_lis = soup.find_all('div', attrs={'class': regex})
file = ''
for cl in content_lis:
    a = cl.select_one('div a strong i')
    b = cl.find(text=True, recursive=False)
    c = cl.select_one('span')
    cc = re.findall("[\w]+", c.text)[0]
    file += f'{a.get_text(strip=True)};{b.strip()};{cc}\n'
with open('file.csv', 'w') as f:
   f.write(file)

使用以下内容保存文件:

Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil
Senoculus barroanus;Chickering, 1941;Panama
Senoculus bucolicus;Chickering, 1941;Panama
Senoculus cambridgei;Mello-Leitão, 1927;Brazil
Senoculus canaliculatus;F. O. Pickard-Cambridge, 1902;Mexico
Senoculus carminatus;Mello-Leitão, 1927;Brazil
Senoculus darwini;(Holmberg, 1883);Argentina
Senoculus fimbriatus;Mello-Leitão, 1927;Brazil
Senoculus gracilis;(Keyserling, 1879);Guyana
Senoculus guianensis;Caporiacco, 1947;j
Senoculus iricolor;(Simon, 1880);Brazil
Senoculus maronicus;Taczanowski, 1872;French

等等

相关问题 更多 >

    热门问题