使用BeautifulSoup的HTML解析使结构与websi不同

2024-10-04 05:25:48 发布

您现在位置:Python中文网/ 问答频道 /正文

当我查看这个链接https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm时,文本以清晰的方式显示。然而,当我试图用beautiful soup解析页面时,我输出了一些看起来不一样的东西——一切都搞砸了。这是密码

import urllib.request
from bs4 import BeautifulSoup

request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)

所需的输出如下所示

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015                   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
              Dealer            :           Asset Manager/       :            Leveraged           :              Other             :     Nonreportable    :
           Intermediary         :           Institutional        :              Funds             :           Reportables          :       Positions      :
    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short   :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE   ($100 X INDEX)                               
CFTC Code #221602                                                    Open Interest is    19,721
Positions
        97      2,934          0      8,941      1,574        973      6,490     11,975      1,694      1,372        539          0        154         32

Changes from:       June 9, 2015                                     Total Change is:     3,505
        48          0          0      2,013      1,141         70        447      1,369        923        -64          0          0         68          2

Percent of Open Interest Represented by Each Category of Trader
       0.5       14.9        0.0       45.3        8.0        4.9       32.9       60.7        8.6        7.0        2.7        0.0        0.8        0.2

Number of Traders in Each Category                                    Total Traders:        31 
         .          .          0          5          .          .          6          9          .          5          .          0
-----------------------------------------------------------------------------------------------------------------------------------------------------------

在查看页面源代码后,我不清楚如何在样式中区分新行-这就是我认为问题的根源。你知道吗

在BeautifulSoup函数中是否需要指定某种类型的结构?我在这里迷路了,非常感谢你的帮助。你知道吗

Fwiw我已经安装了html2text模块,但是在anaconda上使用!conda config --append channels conda-forge!conda install html2text没有安装成功

干杯

编辑:我想出来了。我是个聪明人

request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')

cleaned = []
for i in htm:
    i = BeautifulSoup(i,'html.parser' ).get_text()
    cleaned.append(i)

with open('trouble.txt','w') as f:
    for line in cleaned:
        f.write('%s\n' % line)

Tags: oftextinhttpsrequestwwwfilesurllib