Python代码仅适用于title标记,不适用于tab

2024-09-30 16:34:30 发布

您现在位置:Python中文网/ 问答频道 /正文

在正则表达式中,当写入<title>(.+?)</title>时,它正在工作,但当此标题标记更改为<table>(.+?)</table>时,它将“[]”(方括号)作为输出。 我的代码是:

import urllib
import re

urls = ["http://physics.iitd.ac.in/content/list-faculty-members", "http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=ME","http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=CE"]
i = 0
regex = '<table>(.+?)</table>'
pattern = re.compile(regex)

while i< len(urls):
    htmlfile = urllib.urlopen(urls[i])
    htmltext = htmlfile.read()
    tables  = re.findall(pattern,htmltext)

    print tables
    i+=1

Tags: inimportrehttptitlewwwtableurllib
1条回答
网友
1楼 · 发布于 2024-09-30 16:34:30

使用BeautifulSoup

import urllib
import re

from BeautifulSoup import BeautifulSoup as bs

urls = ["http://physics.iitd.ac.in/content/list-faculty-members", 
        "http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=ME", 
        "http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=CE"]
i = 0

while i < len(urls):
    htmlfile = urllib.urlopen(urls[i])
    htmltext = htmlfile.read()
    soup = bs(htmltext)
    tables = soup.find_all('table')

    print tables
    i+=1

相关问题 更多 >