解析糟糕结构的HTML?

2024-09-23 06:25:54 发布

您现在位置:Python中文网/ 问答频道 /正文

所以,我想做的是创建一个Python函数,它允许我传递我想下载的podcast的年、月、日。然后它将解析HTML并返回当天播客的链接。例如:

>>> get_download_links(year, month, day)
['https://www.tytnetwork.com/?tytpm=44279&type=audio', # Hr 1 (audio)
 'https://www.tytnetwork.com/?tytpm=44277&type=audio'] # Hr 2 (audio)

我试图解析的页面是http://www.tytnetwork.com/annual-archives/2014-main-show-archives/

以下是每月第一周的示例(包括工作日标签):

<tr>
           <th class="tytca-mosname" colspan="5">
            <h3>
             June 2014
            </h3>
           </th>
          </tr>
          <tr>
           <th class="tytca-dayname">
            <h3>
             Mon
            </h3>
           </th>
           <th class="tytca-dayname">
            <h3>
             Tue
            </h3>
           </th>
           <th class="tytca-dayname">
            <h3>
             Wed
            </h3>
           </th>
           <th class="tytca-dayname">
            <h3>
             Thu
            </h3>
           </th>
           <th class="tytca-dayname">
            <h3>
             Fri
            </h3>
           </th>
          </tr>
          <tr>
           <td class="tytca-td">
            <div class="tytca-daynum">
             2
            </div>
            <p>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42848&amp;type=audio" title="Click to download audio file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42851&amp;type=audio" title="Click to download audio file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42848&amp;type=video" title="Click to download video file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42851&amp;type=video" title="Click to download video file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-1/" title="Click to watch the video">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-2/" title="Click to watch the video">
              Hr 2
             </a>
            </p>
           </td>
           <td class="tytca-td">
            <div class="tytca-daynum">
             3
            </div>
            <p>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43325&amp;type=audio" title="Click to download audio file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43324&amp;type=audio" title="Click to download audio file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43325&amp;type=video" title="Click to download video file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43324&amp;type=video" title="Click to download video file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-1/" title="Click to watch the video">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-2/" title="Click to watch the video">
              Hr 2
             </a>
            </p>
           </td>
           <td class="tytca-td">
            <div class="tytca-daynum">
             4
            </div>
            <p>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43635&amp;type=audio" title="Click to download audio file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43633&amp;type=audio" title="Click to download audio file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43635&amp;type=video" title="Click to download video file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43633&amp;type=video" title="Click to download video file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-1/" title="Click to watch the video">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-2/" title="Click to watch the video">
              Hr 2
             </a>
            </p>
           </td>
           <td class="tytca-td">
            <div class="tytca-daynum">
             5
            </div>
            <p>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44046&amp;type=audio" title="Click to download audio file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44044&amp;type=audio" title="Click to download audio file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44046&amp;type=video" title="Click to download video file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44044&amp;type=video" title="Click to download video file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-1/" title="Click to watch the video">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-2/" title="Click to watch the video">
              Hr 2
             </a>
            </p>
           </td>
           <td class="tytca-td">
            <div class="tytca-daynum">
             6
            </div>
            <p>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44279&amp;type=audio" title="Click to download audio file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44277&amp;type=audio" title="Click to download audio file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44279&amp;type=video" title="Click to download video file">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44277&amp;type=video" title="Click to download video file">
              Hr 2
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-1/" title="Click to watch the video">
              Hr 1
             </a>
             <br/>
             <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-2/" title="Click to watch the video">
              Hr 2
             </a>
            </p>
           </td>
          </tr>

我试过使用beautiful soup,但问题是页面结构太差,似乎没有办法实现我想要的。你知道吗

在这一点上,我把这个交给这里的Python大师来帮助我。你知道吗


Tags: tohttpsbrcomtitlewwwvideohr
1条回答
网友
1楼 · 发布于 2024-09-23 06:25:54
import requests
import bs4
import re
url = "http://www.tytnetwork.com/annual-archives/{year}-main-show-archives/"


def getPodCasts(m,d,y):
     my_url = url.format(year=y)
     print my_url
     soup = bs4.BeautifulSoup(requests.get(my_url,headers={'User-agent': 'Mozilla/5.0'}).content)
     calendar_row_for_month=soup.findAll(text=re.compile("^%s.*%s"%(m,y)))[0].parent.parent.parent
     for sib in calendar_row_for_month.findNextSiblings():
        if ">%02d<"%d in str(sib):
           break
     assert ">%02d<"%d in str(sib), "Error Date %s/%s/%s Not Found"%(m,d,y)
     audios = sib.find(text="%02d"%d).next.next
     return re.findall('https?:[^" ]*',str(audios))


print getPodCasts("June",12,2014)

相关问题 更多 >