使用Python+Selenium+PhantomJ拦截Ajax响应

2024-09-30 22:21:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从网页上提取电子邮件地址。该网页包含电子邮件地址的链接。我点击链接。它发送一个XHR请求。ajax响应由js脚本捕获,该脚本解析响应并打开邮件客户端。在

由于Ajax响应不会以任何方式更改html,所以我无法通过监视html来提取电子邮件。在

我需要自己捕获Ajax响应,以便解析它并将其保存到数据库中。在

#
# Initialize browser etc.
#
driver = webdriver.PhantomJS()
emailLink = driver.find_element_by_class_name('email_add')
emailLink.click()

#There is no change in html. I can't find the email address

通过使用firefoxwebdriver代替PhantomJS,我确保代码运行良好。Firefox打开一个邮件客户端来响应ajax回复。在

我尝试使用requests和urllib2发出请求,但不知何故,web服务器识别这些手动生成的请求并重定向到主页。在


Tags: 脚本网页客户端链接电子邮件email地址html
2条回答

I tried issuing the request using requests and urllib2, but somehow the webserver identifies these manually generated requests and redirects to the home page.

如果这是问题所在,那么让服务器认为请求来自浏览器。更改用户代理

Changing user agent on urllib2.urlopen

我从here获取截获代码,并将其包装在一个PhantomJS脚本中,该脚本将其注入我正在抓取的页面中。注意,必须在注入xhtp截获之前加载页面。 而且,必须告诉PhantomJS拦截并打印出打印到控制台.log. 在

我使用了Vijay接受的答案here中的[functions]技巧

要获得更有趣的实时数据源,请尝试使用http://flightaware.com/live/而不是maps.google.com下面,请耐心等待,可能需要一到五分钟才能获得更新。

以下是部分脚本(除了parse,对不起,未测试)PhantomJS脚本:

  var page = new WebPage(), testindex = 0, loadInProgress = false;

  page.onLoadStarted = function() {
    loadInProgress = true;
    console.log("load started");
  };

  page.onLoadFinished = function() {
    loadInProgress = false;
    console.log("load finished");
  };

  page.onConsoleMessage = function(msg) {
    console.log(msg);
  };

  var steps = [
  function() {
    //Load Login Page
    page.open("http://maps.google.com");
  },    
  function() {

    page.render('check.png');  // see what's happened.
    page.evaluate(
     function( x) {
    //inject following code from https://gist.github.com/suprememoocow/2823600
    // I've added console.log() calls along with onConsoleMessage above to see XHR responses.
    (function(XHR) {
        "use strict";

        var stats = [];
        var timeoutId = null;

        var open = XHR.prototype.open;
        var send = XHR.prototype.send;

        XHR.prototype.open = function(method, url, async, user, pass) {
            this._url = url;
            open.call(this, method, url, async, user, pass);
        };

        XHR.prototype.send = function(data) {
            var self = this;
            var start;
            var oldOnReadyStateChange;
            var url = this._url;

            function onReadyStateChange() {
                if(self.readyState == 4 /* complete */) {
                    var time = new Date() - start;                
                    stats.push({
                        url: url,
                        duration: time                    
                    });

                   console.log( "Request:" + data);
                   console.log( "Response:" + this.responseText );

                    if(!timeoutId) {
                        timeoutId = window.setTimeout(function() {
                            var xhr = new XHR();
                            xhr.noIntercept = true;
                            xhr.open("POST", "/clientAjaxStats", true);
                            xhr.setRequestHeader("Content-type","application/json");
                            xhr.send(JSON.stringify({ stats: stats } ));                        

                            timeoutId = null;
                            stats = []; 
                        }, 2000);
                    }                
                }

                if(oldOnReadyStateChange) {
                    oldOnReadyStateChange();
                }
            }

            if(!this.noIntercept) {
                start = new Date();

                if(this.addEventListener) {
                    this.addEventListener("readystatechange", onReadyStateChange, false);
                } else {
                    oldOnReadyStateChange = this.onreadystatechange; 
                    this.onreadystatechange = onReadyStateChange;
                }
            }

            send.call(this, data);
        }
    })(XMLHttpRequest);


     },""
    );
    }, 
    function() {
        // try something else here.  Add more steps as necessary
    }
];

interval = setInterval(function() {
  if (!loadInProgress && typeof steps[testindex] == "function") {
    console.log("step " + (testindex + 1));
    steps[testindex]();
    testindex++;
  }
  if (typeof steps[testindex] != "function") {
     // commented out to run until ctrl-c
    //console.log("test complete!");
    //phantom.exit();
  }
}, 500);

相关问题 更多 >