剪贴或美化从不同的网站抓取链接和文本

2024-09-29 17:16:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个输入的URL中获取链接,但它只适用于一个URL(http://www.businessinsider.com)。它怎样才能适应从输入的任何url中抓取?我用的是美颜素,但刮胡子更适合这个吗?在

def WebScrape():  
    linktoenter = input('Where do you want to scrape from today?: ')
    url = linktoenter
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")

    if linktoenter in url:
        print('Retrieving your links...')
        links = {}
        n = 0
        link_title=soup.findAll('a',{'class':'title'})
        n += 1
        links[n] = link_title
        for eachtitle in link_title:
            print(eachtitle['href']+","+eachtitle.string)
    else:
        print('Please enter another Website...')

Tags: inhttpurltitle链接htmlwwwlink
2条回答

您可以制作一个更通用的scraper,搜索所有标记和这些标记中的所有链接。一旦有了所有链接的列表,就可以使用正则表达式或类似表达式来查找与所需结构匹配的链接。在

import requests
from bs4 import BeautifulSoup
import re

response = requests.get('http://www.businessinsider.com')

soup = BeautifulSoup(response.content)

# find all tags
tags = soup.find_all()

links = []

# iterate over all tags and extract links
for tag in tags:
    # find all href links
    tmp = tag.find_all(href=True)
    # append masters links list with each link
    map(lambda x: links.append(x['href']) if x['href'] else None, tmp)

# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)

代码:

def WebScrape():
    url = input('Where do you want to scrape from today?: ')
    html = urllib.request.urlopen(url).read()
    soup = bs4.BeautifulSoup(html, "lxml")

    title_tags = soup.findAll('a', {'class': 'title'})
    url_titles = [(tag['href'], tag.text)for tag in title_tags]

    if title_tags:
        print('Retrieving your links...')
        for url_title in url_titles:
            print(*url_title)

输出:

^{pr2}$

相关问题 更多 >

    热门问题