有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

JavaJSOUP从amazon抓取图像的宽度和高度。com链接

下面是我试图抓取图像宽度和高度的示例amazon链接:

http://images.amazon.com/images/P/0099441365.01.SCLZZZZZZZ.jpg

我正在使用jsoup,下面是我的代码:

import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Crawler_main {

/**
 * @param args
 */
public static void main(String[] args) {
    // TODO Auto-generated method stub
    String filepath = "C:/imagelinks.txt";
    try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
        String line;
        String width;
        //String height;
        while ((line = br.readLine()) != null) {
           // process the line.
            System.out.println(line);
            Document doc = Jsoup.connect(line).ignoreContentType(true).get();
            //System.out.println(doc.toString());
            Elements jpg = doc.getElementsByTag("img");
            width = jpg.attr("width");
            System.out.println(width);
            //String title = doc.title();
        }
    }
    catch (FileNotFoundException ex){
        System.out.println("File not found");
    }
    catch(IOException ex){
        System.out.println("Unable to read line");
    }
    catch (Exception ex){
        System.out.println("Exception occured");
    }
}

}

html被提取,但当我提取width属性时,它返回空值。当我打印提取的html时,它包含garbadge字符(我猜它是我称之为garbadge字符的实际图像信息。例如:

我甚至不能粘贴文件。toString()将在此编辑器中生成结果。救命啊


共 (1) 个答案

  1. # 1 楼答案

    问题是您获取的是jpg文件,而不是任何HTML。对ignoreContentType(true)的调用提供了一个线索,因为它的documentation声明:

    Ignore the document's Content-Type when parsing the response. By default this is false, an unrecognised content-type will cause an IOException to be thrown. (This is to prevent producing garbage by attempting to parse a JPEG binary image, for example.)

    如果您想获得实际jpg文件的宽度,this SO answer可能有用:

    BufferedImage bimg = ImageIO.read(new File(filename));
    int width          = bimg.getWidth();
    int height         = bimg.getHeight();