<p>第一个问题:数据实际上是在一个框架中的iframe中;您需要查看<a href="https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=APC" rel="nofollow">https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=APC</a>(在这里,您在URL的末尾替换了适当的符号)。在</p>
<p>第二个问题:从页面中提取数据。我个人喜欢lxml和xpath,但是有很多包可以完成这项工作。我可能会期待一些类似</p>
<pre><code>import urllib2
import lxml.html
import re
re_dollars = '\$?\s*(\d+\.\d{2})'
def urlExtractData(url, defs):
"""
Get html from url, parse according to defs, return as dictionary
defs is a list of tuples ("name", "xpath", "regex", fn )
name becomes the key in the returned dictionary
xpath is used to extract a string from the page
regex further processes the string (skipped if None)
fn casts the string to the desired type (skipped if None)
"""
page = urllib2.urlopen(url) # can modify this to include your cookies
tree = lxml.html.parse(page)
res = {}
for name,path,reg,fn in defs:
txt = tree.xpath(path)[0]
if reg != None:
match = re.search(reg,txt)
txt = match.group(1)
if fn != None:
txt = fn(txt)
res[name] = txt
return res
def getStockData(code):
url = 'https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=' + code
defs = [
("stock_name", '//span[@class="header1"]/text()', None, str),
("stock_symbol", '//span[@class="header2"]/text()', None, str),
("last_price", '//span[@class="neu"]/text()', re_dollars, float)
# etc
]
return urlExtractData(url, defs)
</code></pre>
<p>当被称为</p>
^{pr2}$
<p>它回来了</p>
<pre><code>{'stock_name': 'Microsoft Corp', 'last_price': 25.690000000000001, 'stock_symbol': 'MSFT:NASDAQ'}
</code></pre>
<p>第三个问题:这个页面上的标记是表示性的,而不是结构性的——这意味着基于它的代码可能是脆弱的,即页面结构的任何更改(或页面之间的变化)都需要重新编写XPath。在</p>
<p>希望有帮助!在</p>