<p>我认为你没有做错什么。在</p>
<p>这是第二个脚本标记,它混淆了beauthoulsoup。标签如下所示:</p>
<pre><code><script type='text/javascript'>
<! // ><![CDATA[//><!
var arVersion = navigator.appVersion.split("MSIE")
var version = parseFloat(arVersion[1])
function fixPNG(myImage)
{
if ((version >= 5.5) && (version < 7) && (document.body.filters))
{
var imgID = (myImage.id) ? "id='" + myImage.id + "' " : ""
var imgClass = (myImage.className) ? "class='" + myImage.className + "' " : ""
var imgTitle = (myImage.title) ?
"title='" + myImage.title + "' " : "title='" + myImage.alt + "' "
var imgStyle = "display:inline-block;" + myImage.style.cssText
var strNewHTML = "<span " + imgID + imgClass + imgTitle
+ " style=\"" + "width:" + myImage.width
+ "px; height:" + myImage.height
+ "px;" + imgStyle + ";"
+ "filter:progid:DXImageTransform.Microsoft.AlphaImageLoader"
+ "(src=\'" + myImage.src + "\', sizingMethod='scale');\"></span>"
myImage.outerHTML = strNewHTML
}
}
// ><!]]>
</script>
</code></pre>
<p>但BeatifulSoup似乎认为它仍在注释或其他内容中,并将文件的其余部分作为脚本标记的内容。在</p>
<p>尝试:</p>
^{pr2}$
<p>你就会明白我的意思了。在</p>
<p>如果删除CDATA,则应该会发现页面解析正确:</p>
<pre><code>soup = BeautifulSoup(
urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html')
.read()
.replace('<![CDATA[', '').replace('<!]]>', ''))
</code></pre>