在Python中生成url？

import urllib2 def get_page(page): response = urllib2.urlopen(page) html = response.read() p = str(html) return p def get_next_target(page): start_link = page.find('title may-blank') start_quote = page.find('"', start_link + 4) end_quote = page.find ('"', start_quote + 1) aurl = page[start_quote+1:end_quote] # Gets Article URL return aurl, end_quote def print_all_links(page): while True: aurl, endpos = get_next_target(page) if aurl: print("%s" % (aurl)) print("") page = page[endpos:] else: break reddit_url = 'http://www.reddit.com/r/worldnews' print_all_links(get_page(reddit_url))

2条回答

网友

1楼 · 编辑于 2024-09-28 22:26:18

Rawing是正确的，但是当我面对一个XY problem时，我更喜欢提供完成X的最佳方法，而不是修复Y的方法。您应该使用类似^{}的HTML解析器来解析网页：

from bs4 import BeautifulSoup
import urllib2

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    soup = BeautifulSoup(html)
    for a in soup.find_all('a', 'title may-blank ', href=True):
        print(a['href'])

如果您真的对HTML解析器过敏，至少使用regex（即使您应该坚持使用HTML解析）：

import urllib2
import re

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    for href in re.findall(r'<a class="title may-blank " href="(.*?)"', html):
        print(href)

网友

2楼 · 编辑于 2024-09-28 22:26:18

那是因为

start_quote = page.find('"', start_link + 4)

不是你想的那样。开始链接设置为“标题可以空白”的索引。所以，如果你第页。查找在start_link+4，您实际上开始搜索“e may blank”。如果你改变了

start_quote = page.find('"', start_link + 4)

至

start_quote = page.find('"', start_link + len('title may-blank') + 1)

会有用的。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章