我正试图通过编程方式将基因列表发送到著名的网站DAVID(http://david.abcc.ncifcrf.gov/summary.jsp)进行功能注释。虽然还有其他两种方式-API服务(http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html)和web服务(http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html),但前者有更严格的查询限制,后者不接受我的ID类型(http://david.abcc.ncifcrf.gov/forum/viewtopic.php?f=14&t=885),因此唯一的选择似乎是一个程序来发布表单、解析结果页和提取下载链接。使用firefox插件“httpFox”监视传输,我尝试了以下脚本:
import urllib
import urllib2
import requests as rq
import time
_n = 1
url0 = 'http://david.abcc.ncifcrf.gov'
url = 'http://david.abcc.ncifcrf.gov/summary.jsp'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:30.0) Gecko/20100101 Firefox/30.0'
def get_cookie(session_id): # prepare 'Cookie' in the headers for the post
domain_hash = '260267544' # according to what's been sent by firefox
random_uid = '1113731634' # according to what's been sent by firefox
global _t0
init_time = _t0
global _t
prev_time = _t
_t = int(time.time())
curr_time = _t
global _n
_n += 1
session_count = _n
campaign_count = 1
utma = '.'.join(str(x) for x in (domain_hash, random_uid, init_time, prev_time, curr_time, session_count))
utmz = '.'.join(str(x) for x in (domain_hash, init_time, session_count, campaign_count, 'utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'))
cookie = '; '.join(str(x) for x in ('__utma=' + utma, '__utmz=' + utmz, 'JSESSIONID=' + session_id))
return(cookie)
# first get the session ID
_t = int(time.time())
_t0 = _t
headers = {'User-Agent' : user_agent}
r = rq.get(url, headers = headers)
session_id = r.cookies['JSESSIONID']
cookie = get_cookie(session_id)
# get the gene list
gene = []
fh = open('list.txt', 'r')
for line in fh:
gene.append(line.rstrip('\n'))
fh.close()
# then post the form
headers = { # all below is according to what's been sent by firefox
'Host' : 'david.abcc.ncifcrf.gov',
'User-Agent' : user_agent,
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate',
'Referer' : url,
'Cookie': cookie,
'Connection' : 'keep-alive',
# 'Content-Type' : 'multipart/form-data; boundary=---------------------------17914945481928137296675300642',
# 'Content-Length' : '3581'
}
data = { # all below is according to what's been sent by firefox
'idType' : 'OFFICIAL_GENE_SYMBOL',
'uploadType' : 'list',
'multiList' : 'false',
'Mode' : 'paste',
'useIndex' : 'null',
'usePopIndex' : 'null',
'demoIndex' : 'null',
'ids' : '\n'.join(gene),
'removeIndex' : 'null',
'renameIndex' : 'null',
'renamePopIndex' : 'null',
'newName' : 'null',
'combineIndex' : 'null',
'selectedSpecies' : 'null',
'SESSIONID' : session_id[-12:], # according to the pattern that the last 12 characters of 'JSESSIONID' is sent by firefox
'uploadHTML' : 'null',
'managerHTML' : 'null',
'sublist' : '',
'rowids' : '',
'convertedListName' : 'null',
'convertedPopName' : 'null',
'pasteBox' : '\n'.join(gene),
'fileBrowser' : '',
'Identifier' : 'OFFICIAL_GENE_SYMBOL',
'rbUploadType' : 'list'}
r = rq.post(url = url, data = data, headers = headers)
if r.status_code == 200:
fh = open("python.html", 'w')
fh.write(r.text)
fh.close()
但是,我的代码得到的页面是272KB,与httpFox返回的428KB的内容完全不同。我比较了我的脚本和firefox发送的头和表单,区别似乎只在
以上是基本情况,如果有人能帮我具体找出问题所在,我将不胜感激。此外,我还看到了其他一些建议,例如尝试浏览器模拟器“机械化”。但我更好奇的是原因,也就是说,是不是我的程序出了问题,如果是的话,该如何纠正,还是这些模块根本不足以完成任务?谢谢。你知道吗
我要发布的列表是:
Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras
我的浏览器post程序是:
然后浏览器返回一个带有弹出窗口的新页面,提示用户选择物种和背景,这是httpFox在本文中跟踪的内容,也是我试图通过脚本捕获的内容。你知道吗
使用Selenium:
变量“html”包含页面源代码。你知道吗
相关问题 更多 >
编程相关推荐