AttributeError:'HTMLParser'对象没有属性'unescape'

2024-05-19 17:04:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图提取一些表格html,但它返回了一些错误,我不知道为什么

我真的需要一些帮助

代码:

from bs4 import BeautifulSoup
from io import BytesIO
import requests
import datetime
import re
import rows


# date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y')
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM'

response = requests.get(url)
html = response.content


soup = BeautifulSoup(html, 'lxml')
tabela = soup.find("table")

for tag in tabela.find_all('table'):
    _ = tag.replaceWith('')


soup_tr = tabela.findAll("tr")
lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]


s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)


table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))

输出错误如下:

  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module>
    import rows
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module>
    import rows.plugins as plugins
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module>
    from . import plugin_html as html
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module>
    unescape = HTMLParser().unescape
AttributeError: 'HTMLParser' object has no attribute 'unescape'

Tags: infromimporthtmltableblueuserstr
1条回答
网友
1楼 · 发布于 2024-05-19 17:04:14

这并不能真正解决您的错误,但是还有其他更简单的方法来解析来自web站点的表,而不是您已经开始使用的方法

以下是其中之一:

import pandas as pd
import requests

page = requests.get("http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM")
df = pd.read_html(page.text, flavor="bs4")
print(df)
df = pd.concat(df).to_csv("your_magnificent_table.csv", index=False)

输出:

[     Concurso Data Sorteio  1ª Dezena  ...  Rateio_Quadra  Acumulado  Valor_Acumulado
0           1   11/03/1996          4  ...          33021        SIM      1.714.65023
1           2   18/03/1996          9  ...          20891        NÃO        750.04891
2           3   25/03/1996         10  ...          15301        NÃO              000
3           4   01/04/1996          1  ...          18048        SIM        717.08075
4           5   08/04/1996          1  ...           9653        SIM      1.342.48885
..        ...          ...        ...  ...            ...        ...              ...
397       398   21/09/2002         28  ...          14129        NÃO              000
398       399   25/09/2002         59  ...          22501        SIM      5.676.17141
399       400   28/09/2002         29  ...          20314        SIM      6.869.04791
400       401   02/10/2002         50  ...          28818        SIM      7.859.38989
401       402   05/10/2002         27  ...          14808        SIM      9.248.37354

[402 rows x 16 columns]]

或者,如果您愿意,这里有一个.csv文件(实际上是它的一部分):

enter image description here


顺便说一句,用regular expressions来解析HTML是不受欢迎的,被认为是一个糟糕的选择Here's more on the topic

相关问题 更多 >