从wiki中删除表格。带bs4的Python

from bs4 import BeautifulSoup import requests URL_TO = 'https://en.wikipedia.org/wiki/Rammstein_discography' response = requests.get(URL_TO) soup = BeautifulSoup(response.text,'html.parser') soup.prettify() table = soup.find("table", { "class" : "wikitable plainrowheaders" }) for row in table.findAll("tr"): cells = row.findAll("td") bells = row.findAll("th") print(cells, bells)

[<td> <ul><li>Released: 17 May 2019</li> <li>Label: Universal</li> <li>Format: CD, LP, DL</li></ul> </td>, <td>1</td>, <td>5</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>2</td>, <td>1</td>, <td>3</td>, <td>9 </td>, <td> <ul><li>FRA: 50,000 <a href="#cite_note-chartsinfrance-45">[45]</a></li> <li>GER: 260,000<a href="#cite_note-chartsinfrance-45">[45]</a></li> <li>US: 25,000<a href="#cite_note-46">[46]</a></li> <li>WW: 900,000<a href="#cite_note-47">[47]</a></li></ul> </td>, <td> <ul><li>BVMI: 5× Gold<a href="#cite_note-musikindustrie-23">[23]</a></li> <li>BEL: Gold<a href="#cite_note-48">[48]</a></li> <li>SNEP: Gold<a href="#cite_note-snep-44">[44]</a></li> <li>IFPI AUT: 2× Platinum<a href="#cite_note-IFPIAUT-30">[30]</a></li></ul> </td>] [<th scope="row"><a href="/wiki/Untitled_Rammstein_album" title="Untitled Rammstein album">Untitled</a> </th>]

1条回答

网友

1楼 · 发布于 2024-10-03 06:28:43

您可以使用pandas来执行表刮取

import pandas as pd

URL_TO = 'https://en.wikipedia.org/wiki/Rammstein_discography'
df = pd.read_html(URL_TO)
df[1].loc[0, ['Title', 'Album details']].iloc[1]

上面的0表示第一条记录Herzeleid

Out[26]: 'Released: 24 September 1995 Label: Motor, Slash Format: CD, CS, LP, DL'

您可以使用

df[1].loc[:, ['Title', 'Album details']].to_csv('text_file.csv', index=False)

我的代码：

相关问题更多 >

编程相关推荐

热门问题

热门文章