使用BeautifulSoup 3刮取多个页面

2024-06-30 12:58:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我想刮多个特定的链接页。例如,我希望能够选择哪个链接后面有特定数量的迭代。从初始输入刮取的结果必须附加到用户输入或替换。我有:

#url = raw_input('Enter - ')
url = 'http://www.columbia.edu/kermit/k95.html'
itr = raw_input('Enter iteration: ')
i = int(itr)

n = raw_input('Enter Number: ')
n = int(n)

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags = soup('a')

print 'Link:' , url
while i > 0:
    i = i - 1
    if i == 0:
        break
    for tag in tags:  
        me = tag.get('href', None)
        #Just to make sure the link/content match print tag.contents[0]
        link = tags[(n - 1)]
        #print link 
    links = link.get('href', None)
    print 'Link:', links

Enter - http://www.columbia.edu/~fdc/
Enter count: 4
Enter Position: 9
Link: http://www.columbia.edu/~fdc/
Link: http://www.columbia.edu/kermit/k95.html
Link: http://www.columbia.edu/kermit/k95.html (Should be k95faq.html)
Link: http://www.columbia.edu/kermit/k95.html (Should be ckfaq.html)

我得到了我想要的迭代次数和特定的链接,但是我需要第一个url(用户输入)替换为每个迭代变量“links”下的链接。你知道吗

例如,用户输入一个类似http://www.columbia.edu/~fdc/的url,并在页面上重复4次第9个链接。第一次迭代将http://www.columbia.edu/kermit/k95.html作为“链接”返回。我想第二次迭代给我的第9个链接“链接”,这应该是k95常见问题.html你知道吗


Tags: 用户httpurlraw链接htmlwwwlink