Python中web抓取下的td元素

2024-04-18 22:23:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用JS元素捕捉元素,如下所示

 <td align="right" valign="top" class="tabletext2" nowrap="nowrap"> <strong>Program Element Code(s):</strong></td>

网站是http://www.nsf.gov/awardsearch/showAward?AWD_ID=1227110&HistoricalAwards=false

python脚本如下所示

i=1300138;
i=str(i);
url= "http://www.nsf.gov/awardsearch/showAward?AWD_ID="+i+"&HistoricalAwards=false";
r = requests.get (url)
#webbrowser.open(url,new =new );
soup = BeautifulSoup(urllib2.urlopen(url).read())
sp=BeautifulSoup(r.content)
gd=sp.findAll('td',{'class':'tabletext2'},nowrap="nowrap")
for item in gd:
    print item.text;           
    if item.text=="Program Element Code(s):":
        print item.contents;

但是我不能让它工作。我需要抓取程序参考代码前面的ID 感谢您的帮助。谢谢


Tags: idhttpurl元素wwwcodeelementprogram
1条回答
网友
1楼 · 发布于 2024-04-18 22:23:50

一种方法是在正确的"class":"tabletext2"之后获取下一个td:

url= "http://www.nsf.gov/awardsearch/showAward?AWD_ID=1227110&HistoricalAwards=false"
import requests

from bs4 import BeautifulSoup

r = requests.get(url)

tds = BeautifulSoup(r.content).find_all("td",{"class":"tabletext2"})

print([td.find_next("td").text.strip() for td in tds if td.text.startswith("Program Reference Code(s)")])

[u'131E, 113E, 8048, 7433']

相关问题 更多 >