我需要一个regex作为python中mp3文件url的href属性

网友

1楼 · 编辑于 2024-10-01 13:36:09

{I总是建议使用html格式的语法分析器从html文件中提取^而不是使用正则表达式：

import lxml.html

tree = lxml.html.fromstring(htmlcode)
for link in tree.findall(".//a"):
    url = link.get("href")
    if url.endswith(".mp3"):
        print url

网友

2楼 · 编辑于 2024-10-01 13:36:09

首先，是的，你应该使用一个HTML解析器。下面是一些使用Python附带的HTMLParser模块的示例代码：

from HTMLParser import HTMLParser

class ImgSrcHTMLParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.srcs = []

  def handle_starttag(self, tag, attrs):
    if tag == 'img':
      self.srcs.append(dict(attrs).get('src'))

parser = ImgSrcHTMLParser()
parser.feed(html)
for src in parser.srcs:
  print src

这将从img标记收集src。如果你想要的是以.mp3结尾的'a'标签的href，那么它应该很容易适应你的目的。在

假设您真的想使用regex，那么您的regex有一些问题。你没有界定网址，你在网址内使用点。最糟糕的副作用是非mp3url后跟mp3url将被视为一个长URL。例如：“http://foo/bar.gif蛇头http://baz/quux.mp3”。URL中可能需要使用空格和不允许使用的字符（可能是不允许使用的字符）。另外，你忘了在“.mp3”中转义。所以“http://foo/mp3icon.gif”将匹配为“http://foo/mp3”。在

网友

3楼 · 编辑于 2024-10-01 13:36:09

正如其他答案所指出的，使用正则表达式来解析HTML=bad，坏主意。在

考虑到这一点，我将添加我最喜欢的解析器的代码：BeautifulSoup：

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(htmlcode)
links = soup.findAll('a', href=True)
mp3s = [l for l in links if l['href'].endswith('.mp3')]
for song in mp3s:
    print link['href']

相关问题更多 >

编程相关推荐

热门问题

热门文章