提取HTML页面中的链接

2024-10-03 19:29:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从这里获取所有电影/节目netflix链接http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html,以及它们的国家名称。e、 从网页源代码g,我想http://www.netflix.com/WiMovie/80048948,美国等,我做了以下。但它会返回所有链接,而不是我想要的netflix链接。我对regex有点陌生。我该怎么办?你知道吗

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    ##reqlink = re.search('netflix',link.get('href'))
    ##if reqlink:
    print link.get('href')

for link in soup.findAll('img'):
    if link.get('alt') == 'UK' or link.get('alt') == 'USA':
        print link.get('alt')  

如果取消对上述行的注释,则会出现以下错误:

TypeError: expected string or buffer

我该怎么办?你知道吗

from BeautifulSoup import BeautifulSoup
import urllib2
import re
import requests

url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url, stream=True)
count = 1
title=[]
country=[]
for line in r.iter_lines():
    if count == 746:
        urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html')
        soup = BeautifulSoup(line)
        for link in soup.findAll('a', href=re.compile('netflix')):
            title.append(link.get('href'))

        for link in soup.findAll('img'):
            print link.get('alt')
            country.append(link.get('alt'))

    count = count + 1

print len(title), len(country)  

上一个错误已被处理。现在唯一要找的就是多个国家的电影。如何让他们聚在一起。
e、 g.对于10.0级地震,link=http://www.netflix.com/WiMovie/80049286,country=UK,USA


Tags: inimportcomhttpforgethtmllink
3条回答

您的代码可以简化为几个选择:

import requests
from bs4 import BeautifulSoup

url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url)
soup = BeautifulSoup(r.content)

for a in soup.select("a[href*=netflix]"):
    print(a["href"])

对于img:

co = {"UK", "USA"}
for img in soup.select("img[alt]"):
    if img["alt"] in co:
        print(img)

我认为您可以更轻松地遍历列表行,并使用生成器来组装所需的数据结构(忽略代码中的细微差别,我使用的是Python3):

from bs4 import BeautifulSoup
import requests

url = 'http://netflixukvsusa.netflixable.com/2016/07/' \
      'complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url)
soup = BeautifulSoup(r.content)
rows = soup.select('span[class="listings"] tr')


def get_movie_info(rows):
    netflix_url_prefix = 'http://www.netflix.com/'
    for row in rows:
        link = row.find('a',
                        href=lambda href: href and netflix_url_prefix in href)
        if link is not None:
            link = link['href']
        countries = [img['alt'] for img in row('img', class_='flag')]
        yield link, countries


print('\n'.join(map(str, get_movie_info(rows))))

编辑:或者如果您要查找的是dict而不是list:

def get_movie_info(rows):
    output = {}
    netflix_url_prefix = 'http://www.netflix.com/'
    for row in rows:
        link = row.find('a',
                        href=lambda href: href and netflix_url_prefix in href)
        if link is not None:
            name = link.text
            link = link['href']
        countries = [img['alt'] for img in row('img', class_='flag')]
        output[name or 'some_default'] = {'link': link, 'countries': countries}
    return output


print('\n'.join(map(str, get_movie_info(rows).items())))

至于第一个问题-没有href值的链接失败。所以你得到的不是一个字符串,而是None。你知道吗

以下工作:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/
07/complete-alphabetical-list-k-sat-jul-9.html')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    link_href = link.get('href')
    if link_href:  
        reqlink = re.search('netflix',link_href)       
        if reqlink:
            print link_href       

for link in soup.findAll('img'):
    if link.get('alt') == 'UK' or link.get('alt') == 'USA':
        print link.get('alt')  

至于第二个问题,我建议在电影和它出现的国家列表之间建立一个字典,这样就可以更容易地将它格式化成你想要的字符串。你知道吗

相关问题 更多 >