
2024-10-04 05:25:48 发布

您现在位置:Python中文网/ 问答频道 /正文

当我查看这个链接https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm时,文本以清晰的方式显示。然而,当我试图用beautiful soup解析页面时,我输出了一些看起来不一样的东西——一切都搞砸了。这是密码

import urllib.request
from bs4 import BeautifulSoup

request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()


Traders in Financial Futures - Futures Only Positions as of June 16, 2015                   
              Dealer            :           Asset Manager/       :            Leveraged           :              Other             :     Nonreportable    :
           Intermediary         :           Institutional        :              Funds             :           Reportables          :       Positions      :
    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short   :
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE   ($100 X INDEX)                               
CFTC Code #221602                                                    Open Interest is    19,721
        97      2,934          0      8,941      1,574        973      6,490     11,975      1,694      1,372        539          0        154         32

Changes from:       June 9, 2015                                     Total Change is:     3,505
        48          0          0      2,013      1,141         70        447      1,369        923        -64          0          0         68          2

Percent of Open Interest Represented by Each Category of Trader
       0.5       14.9        0.0       45.3        8.0        4.9       32.9       60.7        8.6        7.0        2.7        0.0        0.8        0.2

Number of Traders in Each Category                                    Total Traders:        31 
         .          .          0          5          .          .          6          9          .          5          .          0



Fwiw我已经安装了html2text模块,但是在anaconda上使用!conda config --append channels conda-forge!conda install html2text没有安装成功



request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')

cleaned = []
for i in htm:
    i = BeautifulSoup(i,'html.parser' ).get_text()

with open('trouble.txt','w') as f:
    for line in cleaned:
        f.write('%s\n' % line)

Tags: oftextinhttpsrequestwwwfilesurllib