如何使用python和beauthoulsoup解析脚本标记

2024-07-08 08:17:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图提取页面上document.write函数内的帧标记的属性,如下所示:

<script language="javascript">
.
.
.
document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>');
 if (anchor != "") {
  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>');
 } else {
  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>');
 }
 document.write('</frameset>');


// end hiding -->
</script>

findAll('frame')方法没有帮助。有没有办法读取帧标签的内容?在

我使用的是python2.5和beautifulsoup3.0.8。在

我也愿意在BeautifulSoup3.1中使用Python3.1 只要我能得到结果。在

谢谢


Tags: nonamesrchtmlscriptdocumentframewrite
2条回答

Pyparsing可以帮助您在JS和HTML的混合中架起桥梁。此解析器查找包含一个带引号的字符串或多个带引号的字符串和标识符的字符串表达式的document.write语句,准计算字符串表达式,分析它以查找嵌入的<frame>标记,并将frame属性作为pyparsing ParseResults对象返回,这使您可以访问命名属性,就像它们是对象属性或dict键(您的首选项)。在

jssrc = """
<script language="javascript">
.
.
.
document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>'); 
if (anchor != "") 
{  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); } 
else 
{  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); } 
document.write('</frameset>');
    // end hiding  >
    </script>"""

from pyparsing import *

# define some basic punctuation, and quoted string
LPAR,RPAR,PLUS = map(Suppress,"()+")
qs = QuotedString("'")

# use pyparsing helper to define an expression for opening <frame> 
# tags, which includes support for attributes also
frameTag = makeHTMLTags("frame")[0]

# some of our document.write statements contain not a sting literal,
# but an expression of strings and vars added together; define
# an identifier expression, and add a parse action that converts
# a var name to a likely value
ident = Word(alphas).setParseAction(lambda toks: evalvars[toks[0]])
evalvars = { 'cusip' : "CUSIP", 'anchor' : "ANCHOR" }

# now define the string expression itself, as a quoted string,
# optionally followed by identifiers and quoted strings added
# together; identifiers will get translated to their defined values
# as they are parsed; the first parse action on stringExpr concatenates
# all the tokens; then the second parse action actually parses the
# body of the string as a <frame> tag and returns the results of parsing
# the tag and its attributes; if the parse fails (that is, if the
# string contains something that is not a <frame> tag), the second
# parse action will throw an exception, which will cause the stringExpr
# expression to fail
stringExpr = qs + ZeroOrMore( PLUS + (ident | qs))
stringExpr.setParseAction(lambda toks : ''.join(toks))
stringExpr.addParseAction(lambda toks: 
    frameTag.parseString(toks[0],parseAll=True))

# finally, define the overall document.write(...) expression
docWrite = "document.write" + LPAR + stringExpr + RPAR

# scan through the source looking for document.write commands containing
# <frame> tags using scanString; print the original source fragment, 
# then access some of the attributes extracted from the <frame> tag
# in the quoted string, using either object-attribute notation or 
# dict index notation
for dw,locstart,locend in docWrite.scanString(jssrc):
    print jssrc[locstart:locend]
    print dw.name
    print dw["src"]
    print

印刷品:

^{pr2}$

你不能光靠美女组合。beauthoulsoup解析HTML就像它到达浏览器时一样(在任何重写或DOM操作之前),它不解析(更不用说执行)Javascript。在

在这种特殊情况下,您可能需要使用一个简单的正则表达式。在

相关问题 更多 >

    热门问题