python post返回的页面与brows返回的页面不同

2024-09-30 08:25:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图通过编程方式将基因列表发送到著名的网站DAVID(http://david.abcc.ncifcrf.gov/summary.jsp)进行功能注释。虽然还有其他两种方式-API服务(http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html)和web服务(http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html),但前者有更严格的查询限制,后者不接受我的ID类型(http://david.abcc.ncifcrf.gov/forum/viewtopic.php?f=14&t=885),因此唯一的选择似乎是一个程序来发布表单、解析结果页和提取下载链接。使用firefox插件“httpFox”监视传输,我尝试了以下脚本:

import urllib
import urllib2
import requests as rq
import time

_n = 1
url0 = 'http://david.abcc.ncifcrf.gov'
url = 'http://david.abcc.ncifcrf.gov/summary.jsp'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:30.0) Gecko/20100101 Firefox/30.0'

def get_cookie(session_id): # prepare 'Cookie' in the headers for the post
    domain_hash = '260267544' # according to what's been sent by firefox 
    random_uid = '1113731634' # according to what's been sent by firefox
    global _t0
    init_time = _t0
    global _t 
    prev_time = _t
    _t = int(time.time())
    curr_time = _t
    global _n
    _n += 1
    session_count = _n
    campaign_count = 1
    utma = '.'.join(str(x) for x in (domain_hash, random_uid, init_time, prev_time, curr_time, session_count))
    utmz = '.'.join(str(x) for x in (domain_hash, init_time, session_count, campaign_count, 'utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'))
    cookie = '; '.join(str(x) for x in ('__utma=' + utma, '__utmz=' + utmz, 'JSESSIONID=' + session_id)) 
    return(cookie)

# first get the session ID
_t = int(time.time())
_t0 = _t
headers = {'User-Agent' : user_agent}
r = rq.get(url, headers = headers) 
session_id = r.cookies['JSESSIONID']
cookie = get_cookie(session_id)

# get the gene list
gene = []
fh = open('list.txt', 'r')
for line in fh:
    gene.append(line.rstrip('\n'))

fh.close()

# then post the form
headers = {  # all below is according to what's been sent by firefox
           'Host' : 'david.abcc.ncifcrf.gov',
           'User-Agent' : user_agent, 
           'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
           'Accept-Language' : 'en-US,en;q=0.5', 
           'Accept-Encoding' : 'gzip, deflate',
           'Referer' : url,
           'Cookie': cookie, 
           'Connection' : 'keep-alive', 
#           'Content-Type' : 'multipart/form-data; boundary=---------------------------17914945481928137296675300642',
#           'Content-Length' : '3581'
           }

data = {  # all below is according to what's been sent by firefox
        'idType' : 'OFFICIAL_GENE_SYMBOL',
        'uploadType' : 'list', 
        'multiList' : 'false', 
        'Mode' : 'paste', 
        'useIndex' : 'null',
        'usePopIndex' : 'null', 
        'demoIndex' : 'null', 
        'ids' : '\n'.join(gene), 
        'removeIndex' : 'null', 
        'renameIndex' : 'null', 
        'renamePopIndex' : 'null', 
        'newName' : 'null', 
        'combineIndex' : 'null', 
        'selectedSpecies' : 'null', 
        'SESSIONID' : session_id[-12:], # according to the pattern that the last 12 characters of 'JSESSIONID' is sent by firefox
        'uploadHTML' : 'null', 
        'managerHTML' : 'null', 
        'sublist' : '',
        'rowids' : '',
        'convertedListName' : 'null', 
        'convertedPopName' : 'null', 
        'pasteBox' : '\n'.join(gene), 
        'fileBrowser' : '', 
        'Identifier' : 'OFFICIAL_GENE_SYMBOL', 
        'rbUploadType' : 'list'}

r = rq.post(url = url, data = data, headers = headers)
if r.status_code == 200:
    fh = open("python.html", 'w')
    fh.write(r.text)
    fh.close()

但是,我的代码得到的页面是272KB,与httpFox返回的428KB的内容完全不同。我比较了我的脚本和firefox发送的头和表单,区别似乎只在

  1. cookie字段\uUtma和\uUtmz,但是它们与google分析相关,听起来它们并不重要,而且
  2. “Content Type”和“Content Length”字段位于我注释的第二个标题中。由于Is Python requests doing something wrong here, or is my POST request lacking something?中的建议,似乎没有必要手动指定它们。然而,即使在我评论他们之后,它也不起作用。你知道吗

以上是基本情况,如果有人能帮我具体找出问题所在,我将不胜感激。此外,我还看到了其他一些建议,例如尝试浏览器模拟器“机械化”。但我更好奇的是原因,也就是说,是不是我的程序出了问题,如果是的话,该如何纠正,还是这些模块根本不足以完成任务?谢谢。你知道吗

我要发布的列表是:

Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras

我的浏览器post程序是:

  1. 在firefox中打开http://david.abcc.ncifcrf.gov/summary.jsp
  2. 默认情况下,在左侧面板的“步骤1:输入基因列表A:粘贴列表”框中输入上述基因列表
  3. 点击下拉按钮,在“第2步:选择标识符”中选择“官方基因符号”
  4. 选中“步骤3:列表类型”中的单选按钮“基因列表”
  5. 点击“第4步:提交列表”中的“提交列表”

然后浏览器返回一个带有弹出窗口的新页面,提示用户选择物种和背景,这是httpFox在本文中跟踪的内容,也是我试图通过脚本捕获的内容。你知道吗


Tags: thehttp列表timecookiesession基因firefox
1条回答
网友
1楼 · 发布于 2024-09-30 08:25:16

使用Selenium

from selenium import webdriver
from time import sleep

driver = webdriver.Firefox()
driver.get('http://david.abcc.ncifcrf.gov/summary.jsp')
sleep(0.1)
query = """Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras"""
listBox = driver.find_element_by_id("LISTBox")
listBox.send_keys(query)

IDT = driver.find_element_by_id("IDT")
IDT.send_keys("O")

radioCheck = driver.find_element_by_name("rbUploadType")
radioCheck.click()


submitButton = driver.find_element_by_name("B52")

submitButton.click()
sleep(0.1)
alert = driver.switch_to_alert()
alert.accept()
sleep(0.1)
html = driver.page_source

变量“html”包含页面源代码。你知道吗

相关问题 更多 >

    热门问题