使用python从脚本内容获取值

2024-09-27 00:19:03 发布

您现在位置:Python中文网/ 问答频道 /正文

The source code of one page 我正在编写一个新闻精灵,我想通过python从脚本中获取pubtime值。目前我可以得到脚本的内容,如下:

{
        site:'sports',
        site_cname:'体育',
        site_url:'',
        title:'球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 ',
        id:'20170802002470',
        pubtime:'2017-08-02 06:22',
        type:'2',
        article_url:'',     
        sosokeys:{key1:'NBA',key2:'湖人',key3:'球爹',key4:'詹姆斯'},
        tags:['NBA','湖人','球爹','詹姆斯'],
        catalog:'basket',
        catalog_full:'sports-basket-nba',       
        sub_nav:'nba',      
        topic:{name:'',cname:'',ztcatalog:''},
        subName:{name:'basket',url:'', cname:'篮球'},
        isShowLastAD:'',
        tpl:
{dev:'nba',ver:'1.0.0.0',time:'20150512',type:'1',stype:''}
}

我试着用json.loads()方法将字符串传输到json对象,但失败。它抛出错误:

^{pr2}$

在抛出此错误之前,我已将所有“'”替换为“”。对于这个错误,我知道原因可能是所有的键都应该用双引号括起来,但是这里有太多的键,我认为手动地用双引号将每个键括起来并不是最佳选择。目前,我不知道如何处理pubtime的值。欢迎提出任何建议。提前谢谢你。在


Tags: name脚本jsonurltype错误sitecname
2条回答

这里有一种使用js2xml的方法:

首先,获取您感兴趣的JavaScript代码:

$ scrapy shell http://sports.qq.com/a/20170802/002470.htm
2017-08-04 18:41:23 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(...)
2017-08-04 18:41:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sports.qq.com/a/20170802/002470.htm> (referer: None)

>>> js = response.xpath('//script/text()').get()
>>> print(js)

        ARTICLE_INFO = window.ARTICLE_INFO || {
            site:'sports',
            site_cname:'体育',
            site_url:'http://sports.qq.com',
            title:'球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 ',
            id:'20170802002470',
            pubtime:'2017-08-02 06:22',
            type:'2',
            article_url:'http://sports.qq.com/a/20170802/002470.htm',       
            sosokeys:{key1:'NBA',key2:'湖人',key3:'球爹',key4:'詹姆斯'},
            tags:['NBA','湖人','球爹','詹姆斯'],
            catalog:'basket',
            catalog_full:'sports-basket-nba',       
            sub_nav:'nba',      
            topic:{name:'',cname:'',ztcatalog:''},
            subName:{name:'basket',url:'http://sports.qq.com/nba/', cname:'篮球'},
            isShowLastAD:'',
            tpl:{dev:'nba',ver:'1.0.0.0',time:'20150512',type:'1',stype:''}
            }

然后,将其发送给js2xml.parse()以获得一个解析树:

^{pr2}$

您可以检查使用js2xml.pretty_print()解析的js2xml:

>>> print(js2xml.pretty_print(tree))
<program>
  <assign operator="=">
    <left>
      <identifier name="ARTICLE_INFO"/>
    </left>
    <right>
      <binaryoperation operation="||">
        <left>
          <dotaccessor>
            <object>
              <identifier name="window"/>
            </object>
            <property>
              <identifier name="ARTICLE_INFO"/>
            </property>
          </dotaccessor>
        </left>
        <right>
          <object>
            <property name="site">
              <string>sports</string>
            </property>
            <property name="site_cname">
              <string>体育</string>
            </property>
            <property name="site_url">
              <string>http://sports.qq.com</string>
            </property>
            <property name="title">
              <string>球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 </string>
            </property>
            <property name="id">
              <string>20170802002470</string>
            </property>
            <property name="pubtime">
              <string>2017-08-02 06:22</string>
            </property>
            <property name="type">
              <string>2</string>
            </property>
            <property name="article_url">
              <string>http://sports.qq.com/a/20170802/002470.htm</string>
            </property>
            <property name="sosokeys">
              <object>
                <property name="key1">
                  <string>NBA</string>
                </property>
                <property name="key2">
                  <string>湖人</string>
                </property>
                <property name="key3">
                  <string>球爹</string>
                </property>
                <property name="key4">
                  <string>詹姆斯</string>
                </property>
              </object>
            </property>
            <property name="tags">
              <array>
                <string>NBA</string>
                <string>湖人</string>
                <string>球爹</string>
                <string>詹姆斯</string>
              </array>
            </property>
            <property name="catalog">
              <string>basket</string>
            </property>
            <property name="catalog_full">
              <string>sports-basket-nba</string>
            </property>
            <property name="sub_nav">
              <string>nba</string>
            </property>
            <property name="topic">
              <object>
                <property name="name">
                  <string></string>
                </property>
                <property name="cname">
                  <string></string>
                </property>
                <property name="ztcatalog">
                  <string></string>
                </property>
              </object>
            </property>
            <property name="subName">
              <object>
                <property name="name">
                  <string>basket</string>
                </property>
                <property name="url">
                  <string>http://sports.qq.com/nba/</string>
                </property>
                <property name="cname">
                  <string>篮球</string>
                </property>
              </object>
            </property>
            <property name="isShowLastAD">
              <string></string>
            </property>
            <property name="tpl">
              <object>
                <property name="dev">
                  <string>nba</string>
                </property>
                <property name="ver">
                  <string>1.0.0.0</string>
                </property>
                <property name="time">
                  <string>20150512</string>
                </property>
                <property name="type">
                  <string>1</string>
                </property>
                <property name="stype">
                  <string></string>
                </property>
              </object>
            </property>
          </object>
        </right>
      </binaryoperation>
    </right>
  </assign>
</program>

您需要的数据是||二进制运算的right操作数。可以在解析树上使用XPath来获取它:

>>> o = tree.xpath('//binaryoperation/right/object')[0]
>>> o
<Element object at 0x7f6c8c7967e8>

js2xml.utils.objects.make用于根据以下内容构建Python对象:

>>> from pprint import pprint
>>> pprint(data)
{'article_url': 'http://sports.qq.com/a/20170802/002470.htm',
 'catalog': 'basket',
 'catalog_full': 'sports-basket-nba',
 'id': '20170802002470',
 'isShowLastAD': '',
 'pubtime': '2017-08-02 06:22',
 'site': 'sports',
 'site_cname': '体育',
 'site_url': 'http://sports.qq.com',
 'sosokeys': {'key1': 'NBA', 'key2': '湖人', 'key3': '球爹', 'key4': '詹姆斯'},
 'subName': {'cname': '篮球',
             'name': 'basket',
             'url': 'http://sports.qq.com/nba/'},
 'sub_nav': 'nba',
 'tags': ['NBA', '湖人', '球爹', '詹姆斯'],
 'title': '球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 ',
 'topic': {'cname': '', 'name': '', 'ztcatalog': ''},
 'tpl': {'dev': 'nba',
         'stype': '',
         'time': '20150512',
         'type': '1',
         'ver': '1.0.0.0'},
 'type': '2'}
>>> 

正如@Granitosaurus所提到的,对于这样一个任务来说,这似乎有点“过分”,但是当JSON数据不是100%JSON时(例如使用单引号),它可能会很有用

有一些工具可以解析json变量之类的,主要是^{},这是由制造scrapy的人开发的。
然而,通常简单的regex就足够了:

>>> text = "pubtime:'2017-08-02 06:22',"
>>> import re
>>> re.findall("pubtime:'(.+?)'", text)
['2017-08-02 06:22']

当然,在您的例子中,您将使用response.body_as_unicode()而不是预定义的text变量来搜索整个html正文。在

相关问题 更多 >

    热门问题