我试图提取一些表格html,但它返回了一些错误,我不知道为什么
我真的需要一些帮助
代码:
from bs4 import BeautifulSoup
from io import BytesIO
import requests
import datetime
import re
import rows
# date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y')
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'lxml')
tabela = soup.find("table")
for tag in tabela.find_all('table'):
_ = tag.replaceWith('')
soup_tr = tabela.findAll("tr")
lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]
s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))
输出错误如下:
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module>
import rows
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module>
import rows.plugins as plugins
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module>
from . import plugin_html as html
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module>
unescape = HTMLParser().unescape
AttributeError: 'HTMLParser' object has no attribute 'unescape'
这并不能真正解决您的错误,但是还有其他更简单的方法来解析来自web站点的表,而不是您已经开始使用的方法
以下是其中之一:
输出:
或者,如果您愿意,这里有一个
.csv
文件(实际上是它的一部分):顺便说一句,用
regular expressions
来解析HTML
是不受欢迎的,被认为是一个糟糕的选择Here's more on the topic相关问题 更多 >
编程相关推荐