从HTML页面提取数据（Python）

2条回答

网友

1楼 · 编辑于 2024-05-17 03:45:51

你可以用这个删除html标签

查找：

替换为空：“”

然后在结果字符串上运行这个

1A\s*\.\s*RISK\s+FACTORS(.*?)1B\s*\.\s*UNRESOLVED\s+STAFF\s+COMMENTS

你想要的是捕捉组1。在

你可以在自己的应用程序中换行，或者

将组1字符串粘贴到http://www.regexformat.com应用程序中
文档，右键单击上下文菜单->其他实用程序->自动换行。
在“最大行长度”中输入大约60的值。在

它会弹出5k的环绕文本，如下所示（被截断）。在

The risks described below could materially and adversely 
affect our business, results of operations, financial 
condition and liquidity.  Our business operations could also
be affected by additional factors that apply to all 
companies operating in the U.S. and globally.Strategic 
RisksGeneral or macro-economic factors, both domestically 
and internationally, may materially adversely affect our 
financial performance.General economic conditions, globally 
or in one or more of the markets we serve, may adversely 
affect our financial performance.  Higher interest rates, 
lower or higher prices of petroleum products, including 
crude oil, natural gas, gasoline, and diesel fuel, higher 
costs for electricity and other energy, weakness in the 
housing market, inflation, deflation, increased costs of 
essential services, such as medical care and utilities, 
higher levels of unemployment, decreases in consumer 
disposable income, unavailability of consumer credit, higher
consumer debt levels, changes in consumer spending and 
shopping patterns, fluctuations in currency exchange rates, 
higher tax rates, imposition of new taxes and surcharges, 
other changes in tax laws, other regulatory changes, overall

网友

2楼 · 编辑于 2024-05-17 03:45:51

如果您想使用regEx，可以使用以下在python3.5.2中运行的代码。尝试打印您的“文本”以查看第1A项的实际值，该值与您在网页中看到的值（第160项1A项）不同。希望这有帮助。在

import urllib.request
from urllib.error import URLError, HTTPError
import re
import contextlib

mainpage = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"

try:
    with contextlib.closing(urllib.request.urlopen(mainpage)) as url:
        htmltext = url.read().decode('utf-8')
        #print(htmltext)
except HTTPError as e:
    print("HTTPError") 
except URLError as e:
    print("URLError") 
else:
    results = re.findall(r'(?=ITEM\&\#160\;1A\.(.*)(RISK FACTORS))(.*)(?=ITEM\&\#160\;1B\.(.*)(UNRESOLVED))',htmltext)
    print (results)

相关问题更多 >

编程相关推荐

热门问题

热门文章

从HTML页面提取数据（Python）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >