像beauthulsoup一样解析Enlive中的HTML

2024-10-01 19:21:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试用Enlive从Clojure的HTML中获取链接。我能从一个页面得到所有链接的列表吗?我能重复一遍吗?在

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>
^{pr2}$

或者

links = soup('a')

在Clojure和Enlive中如何做到这一点?在


Tags: andcomidhttptitle链接examplehtml
2条回答

首先需要使用Enlive的html-resource函数摄取一些HTML。我们会抓住的新闻谷歌公司名称:

(defn fetch-url [url]
  (html/html-resource (java.net.URL. url)))
(def goog-news (fetch-url "https://news.google.com"))

要获取所有的<a>标记,请将select函数与一个简单的选择器一起使用(第二个参数):

^{pr2}$

这将计算为一个映射序列,每个<a>标记一个。下面是一个来自今日新闻的<a>标记映射的示例:

{:tag :a,
 :attrs {:class "nuEeue hzdq5d ME7ew",
         :target "_blank",
         :href "https://www.vanityfair.com/hollywood/2018/01/first-black-panther-reviews",
         :jsname "NV4Anc"},
 :content ("The First Black Panther Reviews Are Here—and They're Ecstatic")}

要获得每个<a>的内部文本,可以使用mapEnlive的text函数来处理结果,例如(map html/text *1)。要获得每个href,可以(map (comp :href :attrs) *1)。在

这很简单:

(require '[net.cgrand.enlive-html :as enlive])

(let [data (enlive/html-resource (java.net.URL. "https://www.stackoverflow.com"))
      all-refs (enlive/select data [:a])]
  (first all-refs))

;;=> {:tag :a, :attrs {:href "https://stackoverflow.com", :class "-logo js-gps-track", :data-gps-track "top_nav.click({is_current:true, location:1, destination:8})"}, :content ("\n                   " {:tag :span, :attrs {:class "-img"}, :content ("Stack Overflow")} "\n                ")}

all-refs集合将以生动的表示形式包含来自page的所有链接。在

^{pr2}$

例如,将从链接收集所有href

相关问题 更多 >

    热门问题