有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java使用没有img标记的jsoup提取图像

我需要提取div中的图像,而src不在img标记中列出。我也无法执行getElementById(),因为它随页面而异。是否有一些正则表达式可用于从文档中提取此类情况下的图像?感谢您的帮助

HTML代码段:

<div 
    class="rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center" 
    data-src="/content/dam/Image.jpg.transform/default- 
mobile/image.jpg" 
    data-mobile-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-tablet-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-desktop- rendition="/content/dam/Image.jpg.transform/default-desktop/image.jpg" 
    style="background-image: url(&quot;/content/dam/Image.jpg.transform/default- 
mobile/image.jpg&quot;);">
</div>

共 (3) 个答案

  1. # 1 楼答案

    评论中的解释:

        Document doc = Jsoup.parse(
            "<div class=\"rendition-bg rendition-bg alignment desktop-center-center mobile-center-center \" "
            + "data-src=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
            + "data-mobile-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
            + "data-tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
            + "data-desktop-rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\" "
            + "style=\"background-image: url(&quot;/content/dam/Image.jpg.transform/default-mobile/image.jpg&quot;);\"></div>");
    
        // select all elements with "data-src" attribute, but here we use only the first of them
        Map<String, String> dataAttributes = doc.select("[data-src]").first().dataset();
    
        // here we have all data attributes of this element:
        System.out.println(dataAttributes);
    
        // you can access them like this:
        System.out.println(dataAttributes.get("mobile-rendition"));
        System.out.println(dataAttributes.get("tablet-rendition"));
        System.out.println(dataAttributes.get("desktop-rendition"));
    
        // split and create list of urls (contains duplicates)
        List<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split("\\.transform")))
                    .collect(Collectors.toList());
    
        // if you need only unique urls use this one instead:
        //  Set<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split(".transform"))).collect(Collectors.toSet());
        System.out.println(urls);
    
  2. # 2 楼答案

    远非优雅或简单的解决方案,但这里有一些东西,希望能给你一些开始:

        String snippet =
          "<div class=\"rendition-bg rendition-bg alignment desktop-center-center" +
            "mobile-center-center \" data-src=\"/content/dam/Image.jpg.transform/default-" +
            "mobile/image.jpg\" data-mobile- \n" +
            "rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" data-" +
            "tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\"" +
            "data-desktop- rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\"" +
            "style=\"background-image: url(&quot;/content/dam/Image.jpg.transform/default-" +
            "mobile/image.jpg&quot;);\"></div>";
    
        List<String> imgAttrs =
          Jsoup.parse(snippet)
            .getElementsByTag("div")
            .stream()
            // get lists of attributes
            .map(Element::attributes)
            // flatten all attrs to single list
            .flatMap(attrs -> attrs.asList().stream())
            // filter attributes
            .filter(attribute -> attribute.getValue() != null && attribute.getValue().contains(".jpg"))
            // map to values
            .map(Attribute::getValue)
            // replace all ".transform" with a whitespace
            .map(attrValue -> attrValue.replace(".transform", " "))
            // get url value of a "background-image"
            .map(attrValue -> getUrlFromBackgroundImage(attrValue))
            // split attributes by whitespaces
            .flatMap(attrValue -> Stream.of(attrValue.split(" ")))
            .collect(toList());
          }
    
         private static String getUrlFromBackgroundImage(final String backgroundImage) {
            Pattern pattern = Pattern.compile("background-image:[ ]?url\\((['\"]?(.*?\\.(?:png|jpg|jpeg|gif)(\\s)?)*)");
            Matcher matcher = pattern.matcher(backgroundImage);
            return matcher.find() ? matcher.group(1) : backgroundImage;
         }
    

    imgAttrs的内容应为:

    /content/dam/Image.jpg
    /default-mobile/image.jpg
    /content/dam/Image.jpg
    /default-desktop/image.jpg
    /content/dam/Image.jpg
    /default-mobile/image.jpg
    "/content/dam/Image.jpg
    /default-mobile/image.jpg
    

    不确定这是否是你需要的

  3. # 3 楼答案

    仔细观察div,我们可以看到引用了两幅图像。是的

    data-src=                  "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-mobile-rendition=     "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-tablet-rendition=     "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-desktop- rendition=   "/content/dam/Image.jpg.transform/default-desktop/image.jpg" 
    style="background-image: url/content/dam/Image.jpg.transform/default-mobile/image.jpg
    

    在这四个图像引用中,3个引用同一图像,而另一个引用桌面图像。因此,如果我们需要提取这两幅图像的URL:

    data-src=                  "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-desktop- rendition=   "/content/dam/Image.jpg.transform/default-desktop/image.jpg"
    

    我们可以使用以下代码:

            Elements els = doc.select("div.rendition-bg");
            for (Element ele :els){
                    System.out.println(ele.absUrl("data-src"));
                    System.out.println(ele.absUrl("data-desktop-"));                
                }
    

    如果我正确理解您的要求,请告诉我