基于HTML标题标签内容的Python ifstatement

import urllib2 opp1 = 1 oppn = 2 for opp in range(opp1, oppn + 1): oppurl = (something.com) response = urllib2.urlopen(oppurl) html = response.read() # syntax error on the next line # if Title == 'Record doesn't exist': continue else: oppfilename = 'work/opptest' + str(opp) + '.htm' oppfile = open(oppfilename, 'w') opp.write(opphtml) print 'Wrote ', oppfile votefile.close()

2条回答

网友

1楼 · 编辑于 2024-05-06 16:55:41

试试Beautiful Soup。这是一个非常容易使用的用于解析HTML文档和片段的库。在

import urllib2
from BeautifulSoup import BeautifulSoup

for opp in range(opp1,oppn+1):
    oppurl =  (www.myhomepage.com)
    response = urllib2.urlopen(oppurl)
    html = response.read()


    soup = BeautifulSoup(html)

    if soup.head.title == "Record doesn't exist":
            continue
        else:
            oppfilename = 'work/opptest'+str(opp)+'.htm'
            oppfile = open(oppfilename,'w')
            opp.write(opphtml)
            print 'Wrote ',oppfile
            votefile.close()

编辑

如果不能选择靓汤，我个人会使用正则表达式。然而，我拒绝在公共场合承认这一点，因为我不会让别人知道我会屈尊于简单的解决方案。让我们看看“电池包”里有什么。在

^{}看起来很有前途，让我们看看能否按我们的意愿来做。在

^{pr2}$

那真是太痛苦了。几乎和Java一样冗长。（开玩笑）

还有什么？有一个^{}一个“轻量级DOM实现”。我喜欢“轻量级”的声音，意思是我们可以用一行代码来完成，对吗？在

import xml.dom.minidom
html = '<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>'

title = ''.join(node.data for node in xml.dom.minidom.parseString(html).getElementsByTagName("title")[0].childNodes if node.nodeType == node.TEXT_NODE)

>>> print title
Test

我们只有一条线！在

所以我听说这些正则表达式在从HTML中提取文本时非常有效。我想你应该用那些。在

网友

2楼 · 编辑于 2024-05-06 16:55:41

可以使用正则表达式获取标题标记的内容：

m = re.search('<title>(.*?)</title>', html)
if m:
    title = m.group(1)

我们只有一条线！在

相关问题更多 >

编程相关推荐

热门问题

热门文章