在pythonhtmlpar中查找特定于<h1>标记的<p>标记

2024-04-27 05:30:12 发布

您现在位置：Python中文网/ 问答频道 /正文

6877

网友

男 | 程序猿一只，喜欢编程写python代码。

我试图通过一系列的网页进行解析，在每一个页面的标题出现后只抓取3个段落。它们都有相同的格式（我想）。我使用的是urllib2和beautiful soup，但我不太确定如何跳转到头，然后抓取后面的几个

标记。我知道第一个分割（“h1”）是不正确的，但这是迄今为止我唯一一次像样的尝试。这是我的密码

from bs4 import BeautifulSoup
import urllib2
from HTMLParser import HTMLParser

BANNED = ["/events/new"]

def main():

    soup = BeautifulSoup(urllib2.urlopen('http://b-line.binghamton.edu').read())

     for link in soup.find_all('a'):
         link = link.get('href')      
        if link != None and link not in BANNED and "/events/" in link:
            print()
            print(link)          
            eventPage = "http://b-line.binghamton.edu" + link
            bLineSubPage = urllib2.urlopen(eventPage)   
            bLineSubPageStr = bLineSubPage.read()
            headAccum = 0  
            for data in bLineSubPageStr.split("<h1>"):
                if(headAccum < 1):
                    accum = 0 
                    for subData in data.split("<p>"):
                        if(accum < 5):
                            try:
                                print(BeautifulSoup(subData).get_text())
                            except Exception as e:
                                print(e) 
                            accum+=1
                    print()
                headAccum += 1           
            bLineSubPage.close()         
            print()

main()

Tags： in from import for if link urllib2 h1

1条回答

网友

1楼 · 发布于 2024-04-27 05:30:12

>>> page_txt = urllib2.urlopen("http://b-line.binghamton.edu/events/9305").read(
>>> soup = bs4.BeautifulSoup(pg.split("<h1>",1)[-1])
>>> print soup.find_all("p")[:3]

这就是你想要的吗？在

在pythonhtmlpar中查找特定于<h1>标记的<p>标记

相关问题更多 >

编程相关推荐

热门问题

热门文章

在pythonhtmlpar中查找特定于<h1>标记的<p>标记

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >