beauthoulsoup不返回所有d

2024-09-19 23:34:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用Python的库BeautifulSoup来解析月球阶段的一些数据。在

from bs4 import BeautifulSoup
import urllib2

moon_url = "http://www.moongiant.com/phase/today/"


try:
    rqest =  urllib2.urlopen(moon_url)
    moon_Soup = BeautifulSoup(rqest, 'lxml')
    moon_angle = 0
    moon_illumination = 0
    main_data = moon_Soup.find('div', {'id' : 'moonDetails'})
    print main_data

except urllib2.URLError:
    print "Error"

但是输出而不是这个:

^{pr2}$

只是这个:

<div id="moonDetails">
</div>

有什么想法吗?在


Tags: importdividurldatamainurllib2阶段
3条回答

另一种方法,我从根的答案中抄袭了access Chrome DOM。在

其思想是,您可以同时使用seleniumlxml来访问由javascript加载和处理的页面的DOM。在

>>> moon_url = "http://www.moongiant.com/phase/today/"
>>> import selenium.webdriver as webdriver
>>> import lxml.html as html
>>> import lxml.html.clean as clean
>>> 
>>> browser = webdriver.Chrome()
>>> browser.get(moon_url)
>>> content = browser.page_source
>>> cleaner = clean.Cleaner()
>>> content = cleaner.clean_html(content)
>>> doc = html.fromstring(content)
>>> type(doc)
<class 'lxml.html.HtmlElement'>
>>> type(content)
<class 'str'>
>>> open('c:/scratch/content.htm','w').write(content)
27070

一旦完成了这一步,正如上面最后几条语句所示,您就可以访问DOM,要么作为HTML,要么作为适合用lxml处理的树。在您的例子中,您可能更喜欢用HTML来做汤;这意味着将BeautifulSoup应用于content。在

顺便说一句,当我保存content时,我确实在HTML中找到了以下结构,正如人们所预期的那样。在

^{pr2}$

正如RaminNietzsche在注释中所述,您应该在这个特殊的script标记中提取脚本的文本。例如,您可以使用regexbuilt-in methods(比如split()strip()和{}。在

代码:

from bs4 import BeautifulSoup
import requests
import re
import json

moon_url = "http://www.moongiant.com/phase/today/"
html_source =  requests.get(moon_url).text

moon_soup = BeautifulSoup(html_source, 'html.parser')

data = moon_soup.find_all('script', {'type' : 'text/javascript'})

for d in data:
    d = d.text
    if 'var jArray=' in d:
        jArray = re.search('\{(.*?)\}', d).group()
        moon_data = json.loads(jArray)
        print(moon_data)

        #if you want mArray data too, you just have to:
        # 1. add `'var mArray=' in d` in the if clause, and
        # 2. uncomment the following lines
        #mArray = re.search('\[+(.*?)\];', d).group()
        #print(mArray)

输出:

^{pr2}$

因为它是作为JSON加载的,所以您可以这样浏览它:

示例代码:

print(moon_data['4'])
print('-')*5
print(moon_data['4'][2])

输出:

['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5']
  -
Sun Angle: 0.53276322269153

实际上,在RaminNietzsche的评论之后,我使用了dryscrape库。在

from bs4 import BeautifulSoup
import urllib2
import dryscrape

    moon_url = "http://www.moongiant.com/phase/today/"

try:
    rqest =  urllib2.urlopen(moon_url)
    session = dryscrape.Session()
    session.visit(moon_url)
    response = session.body()
    soup = BeautifulSoup(response, 'lxml')

    moon_data = soup.findAll('div', {'id':'moonDetails'})
    print moon_data

因此,现在的输出是:

^{pr2}$

谢谢大家的回答!在

相关问题 更多 >