当我查看这个链接https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm时,文本以清晰的方式显示。然而,当我试图用beautiful soup解析页面时,我输出了一些看起来不一样的东西——一切都搞砸了。这是密码
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)
所需的输出如下所示
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Dealer : Asset Manager/ : Leveraged : Other : Nonreportable :
Intermediary : Institutional : Funds : Reportables : Positions :
Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE ($100 X INDEX)
CFTC Code #221602 Open Interest is 19,721
Positions
97 2,934 0 8,941 1,574 973 6,490 11,975 1,694 1,372 539 0 154 32
Changes from: June 9, 2015 Total Change is: 3,505
48 0 0 2,013 1,141 70 447 1,369 923 -64 0 0 68 2
Percent of Open Interest Represented by Each Category of Trader
0.5 14.9 0.0 45.3 8.0 4.9 32.9 60.7 8.6 7.0 2.7 0.0 0.8 0.2
Number of Traders in Each Category Total Traders: 31
. . 0 5 . . 6 9 . 5 . 0
-----------------------------------------------------------------------------------------------------------------------------------------------------------
在查看页面源代码后,我不清楚如何在样式中区分新行-这就是我认为问题的根源。你知道吗
在BeautifulSoup函数中是否需要指定某种类型的结构?我在这里迷路了,非常感谢你的帮助。你知道吗
Fwiw我已经安装了html2text模块,但是在anaconda上使用!conda config --append channels conda-forge
和!conda install html2text
没有安装成功
干杯
编辑:我想出来了。我是个聪明人
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')
cleaned = []
for i in htm:
i = BeautifulSoup(i,'html.parser' ).get_text()
cleaned.append(i)
with open('trouble.txt','w') as f:
for line in cleaned:
f.write('%s\n' % line)
目前没有回答
相关问题 更多 >
编程相关推荐