像beauthulsoup一样解析Enlive中的HTML

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify()) # <html> # <head> # <title> # The Dormouse's story # </title> # </head> # <body> # # # The Dormouse's story # # # # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # , # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # and # <a class="sister" href="http://example.com/tillie" id="link2"> # Tillie # </a> # ; and they lived at the bottom of a well. # # # ... # # </body> # </html>

2条回答

网友

1楼 · 编辑于 2024-10-01 19:21:17

首先需要使用Enlive的html-resource函数摄取一些HTML。我们会抓住的新闻谷歌公司名称：

(defn fetch-url [url]
  (html/html-resource (java.net.URL. url)))
(def goog-news (fetch-url "https://news.google.com"))

要获取所有的<a>标记，请将select函数与一个简单的选择器一起使用（第二个参数）：

^{pr2}$

这将计算为一个映射序列，每个<a>标记一个。下面是一个来自今日新闻的<a>标记映射的示例：

{:tag :a,
 :attrs {:class "nuEeue hzdq5d ME7ew",
         :target "_blank",
         :href "https://www.vanityfair.com/hollywood/2018/01/first-black-panther-reviews",
         :jsname "NV4Anc"},
 :content ("The First Black Panther Reviews Are Here—and They're Ecstatic")}

要获得每个<a>的内部文本，可以使用mapEnlive的text函数来处理结果，例如(map html/text *1)。要获得每个href，可以(map (comp :href :attrs) *1)。在

网友

2楼 · 编辑于 2024-10-01 19:21:17

这很简单：

(require '[net.cgrand.enlive-html :as enlive])

(let [data (enlive/html-resource (java.net.URL. "https://www.stackoverflow.com"))
      all-refs (enlive/select data [:a])]
  (first all-refs))

;;=> {:tag :a, :attrs {:href "https://stackoverflow.com", :class "-logo js-gps-track", :data-gps-track "top_nav.click({is_current:true, location:1, destination:8})"}, :content ("\n                   " {:tag :span, :attrs {:class "-img"}, :content ("Stack Overflow")} "\n                ")}

all-refs集合将以生动的表示形式包含来自page的所有链接。在

^{pr2}$

例如，将从链接收集所有href值

相关问题更多 >

编程相关推荐

热门问题

热门文章