AttributeError:'HTMLParser'对象没有属性'unescape'

from bs4 import BeautifulSoup from io import BytesIO import requests import datetime import re import rows # date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y') url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM' response = requests.get(url) html = response.content soup = BeautifulSoup(html, 'lxml') tabela = soup.find("table") for tag in tabela.find_all('table'): _ = tag.replaceWith('') soup_tr = tabela.findAll("tr") lista_tr = list(soup_tr) lista_tr[0] = lista_tr[1] s = "".join([str(l) for l in lista_tr]) s = "<table>" + s + "</table>" s = re.sub("()", "", s, flags=re.DOTALL) table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))

File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module> import rows File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module> import rows.plugins as plugins File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module> from . import plugin_html as html File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module> unescape = HTMLParser().unescape AttributeError: 'HTMLParser' object has no attribute 'unescape'

1条回答

网友

1楼 · 发布于 2024-05-19 17:04:14

这并不能真正解决您的错误，但是还有其他更简单的方法来解析来自web站点的表，而不是您已经开始使用的方法

以下是其中之一：

import pandas as pd
import requests

page = requests.get("http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM")
df = pd.read_html(page.text, flavor="bs4")
print(df)
df = pd.concat(df).to_csv("your_magnificent_table.csv", index=False)

输出：

[     Concurso Data Sorteio  1ª Dezena  ...  Rateio_Quadra  Acumulado  Valor_Acumulado
0           1   11/03/1996          4  ...          33021        SIM      1.714.65023
1           2   18/03/1996          9  ...          20891        NÃO        750.04891
2           3   25/03/1996         10  ...          15301        NÃO              000
3           4   01/04/1996          1  ...          18048        SIM        717.08075
4           5   08/04/1996          1  ...           9653        SIM      1.342.48885
..        ...          ...        ...  ...            ...        ...              ...
397       398   21/09/2002         28  ...          14129        NÃO              000
398       399   25/09/2002         59  ...          22501        SIM      5.676.17141
399       400   28/09/2002         29  ...          20314        SIM      6.869.04791
400       401   02/10/2002         50  ...          28818        SIM      7.859.38989
401       402   05/10/2002         27  ...          14808        SIM      9.248.37354

[402 rows x 16 columns]]

或者，如果您愿意，这里有一个.csv文件（实际上是它的一部分）：

顺便说一句，用regular expressions来解析HTML是不受欢迎的，被认为是一个糟糕的选择Here's more on the topic

相关问题更多 >

编程相关推荐

热门问题

热门文章