我试图通过一系列的网页进行解析,在每一个页面的标题出现后只抓取3个段落。它们都有相同的格式(我想)。我使用的是urllib2和beautiful soup,但我不太确定如何跳转到头,然后抓取后面的几个
标记。我知道第一个分割(“h1”)是不正确的,但这是迄今为止我唯一一次像样的尝试。这是我的密码
from bs4 import BeautifulSoup
import urllib2
from HTMLParser import HTMLParser
BANNED = ["/events/new"]
def main():
soup = BeautifulSoup(urllib2.urlopen('http://b-line.binghamton.edu').read())
for link in soup.find_all('a'):
link = link.get('href')
if link != None and link not in BANNED and "/events/" in link:
print()
print(link)
eventPage = "http://b-line.binghamton.edu" + link
bLineSubPage = urllib2.urlopen(eventPage)
bLineSubPageStr = bLineSubPage.read()
headAccum = 0
for data in bLineSubPageStr.split("<h1>"):
if(headAccum < 1):
accum = 0
for subData in data.split("<p>"):
if(accum < 5):
try:
print(BeautifulSoup(subData).get_text())
except Exception as e:
print(e)
accum+=1
print()
headAccum += 1
bLineSubPage.close()
print()
main()
这就是你想要的吗?在
相关问题 更多 >
编程相关推荐