Python，分析htm

item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"] item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]

2条回答

网友

1楼 · 编辑于 2024-06-14 11:20:41

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)


In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.text
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]
    ...:     
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...

或者要获得您想要的输出：

^{pr2}$

网友

2楼 · 编辑于 2024-06-14 11:20:41

根据您的要求，我粘贴了解析器的简单示例。如您所见，它使用lxml。对于lxml，有两种方法来处理DOM树，一种是xpath，另一种是css选择器我更喜欢xpath。在

import lxml.html
import decimal
import urllib

def parse():
    url = 'https://sometotosite.com'
    doc = lxml.html.fromstring(urllib.urlopen(url).read())
    main_div = doc.xpath("//div[@id='line']")[0]
    main = {}
    tr = []
    for el in main_div.getchildren():
    if el.xpath("descendant::a[contains(@name,'tn')]/text()"):
        category = el.xpath("descendant::a[contains(@name,'tn')]/text()")[0]
        main[category] = ''
        tr = []
    else:
        for element in el.getchildren():
            if '&#8212' in lxml.html.tostring(element):
                tr.append(element)
                print category, tr
parse()

LXML official site

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python，分析htm

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >