从HTML页面提取数据(Python)

2024-05-17 06:21:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从this page中提取一些数据。我想提取两个字符串之间的任何文本(项目1A风险因素和项目1B未解决员工意见)。很难想出正确的正则表达式来实现这一点。在

import re
import html2text

url = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"
html = urllib.urlopen(url).read()

text = html2text.html2text(html)

regex= '(?<=Item 1A Risk Factors)(.*)(?=Item 1B Unresolved)'

match = re.search(regex, text, flags=re.IGNORECASE)

print match

上面的代码返回“none”。有什么建议吗?在


Tags: 数据项目字符串text文本importreurl
2条回答

你可以用这个删除html标签

查找:

<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?: [\S\s]*? )|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

替换为空:“”

然后在结果字符串上运行这个

1A\s*\.\s*RISK\s+FACTORS(.*?)1B\s*\.\s*UNRESOLVED\s+STAFF\s+COMMENTS

你想要的是捕捉组1。在

你可以在自己的应用程序中换行,或者

将组1字符串粘贴到http://www.regexformat.com应用程序中
文档,右键单击上下文菜单->其他实用程序->自动换行。
在“最大行长度”中输入大约60的值。在

它会弹出5k的环绕文本,如下所示(被截断)。在

The risks described below could materially and adversely 
affect our business, results of operations, financial 
condition and liquidity.  Our business operations could also
be affected by additional factors that apply to all 
companies operating in the U.S. and globally.Strategic 
RisksGeneral or macro-economic factors, both domestically 
and internationally, may materially adversely affect our 
financial performance.General economic conditions, globally 
or in one or more of the markets we serve, may adversely 
affect our financial performance.  Higher interest rates, 
lower or higher prices of petroleum products, including 
crude oil, natural gas, gasoline, and diesel fuel, higher 
costs for electricity and other energy, weakness in the 
housing market, inflation, deflation, increased costs of 
essential services, such as medical care and utilities, 
higher levels of unemployment, decreases in consumer 
disposable income, unavailability of consumer credit, higher
consumer debt levels, changes in consumer spending and 
shopping patterns, fluctuations in currency exchange rates, 
higher tax rates, imposition of new taxes and surcharges, 
other changes in tax laws, other regulatory changes, overall

如果您想使用regEx,可以使用以下在python3.5.2中运行的代码。 尝试打印您的“文本”以查看第1A项的实际值,该值与您在网页中看到的值(第160项1A项)不同。希望这有帮助。在

import urllib.request
from urllib.error import URLError, HTTPError
import re
import contextlib

mainpage = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"

try:
    with contextlib.closing(urllib.request.urlopen(mainpage)) as url:
        htmltext = url.read().decode('utf-8')
        #print(htmltext)
except HTTPError as e:
    print("HTTPError") 
except URLError as e:
    print("URLError") 
else:
    results = re.findall(r'(?=ITEM\&\#160\;1A\.(.*)(RISK FACTORS))(.*)(?=ITEM\&\#160\;1B\.(.*)(UNRESOLVED))',htmltext)
    print (results)

相关问题 更多 >