http如何用Java编程下载网页

9 月 Questions & Answers 203815

我希望能够获取网页的html并将其保存到String，以便对其进行一些处理。另外，我如何处理各种类型的压缩

我将如何使用Java来实现这一点

Tags:

共 (6) 个答案

# 1 楼答案

Bill的回答很好，但您可能希望对请求执行一些操作，如压缩或用户代理。下面的代码显示了如何对请求进行各种类型的压缩

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

要同时设置用户代理，请添加以下代码：

conn.setRequestProperty ( "User-agent", "my agent name");

# 2 楼答案
我会使用一个像样的HTML解析器，比如Jsoup。这样就很容易了：
```
String html = Jsoup.connect("http://stackoverflow.com").get().html();
```
它完全透明地处理GZIP和分块响应以及字符编码。它还提供了更多的优势，比如像jQuery一样，通过CSS选择器生成HTMLtraversing和manipulation。您只需要将其作为Document而不是作为String来获取
```
Document document = Jsoup.connect("http://google.com").get();
```
你真的don't想在HTML上运行基本的字符串方法甚至正则表达式来处理它

另请参见：
- What are the pros and cons of leading HTML parsers in Java?

# 3 楼答案

您很可能需要从安全网页（https协议）中提取代码。在下面的示例中，html文件被保存到c:\temp\filename中。html享受

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil {
    public static void main(String[] args) throws Exception {
        String httpsURL = "https://stackoverflow.com";
        String FILENAME = "c:\\temp\\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            bw.write(inputLine);
        }
        in.close(); 
        bw.close();
    }
}

# 4 楼答案

嗯，您可以使用内置库，例如URL和URLConnection，但是它们没有提供太多的控制

就我个人而言，我会选择Apache HTTPClient图书馆
Edit:HTTPClient已被Apache设置为生命终止。替换为：HTTP Components

# 5 楼答案

下面是一些使用Java的URL类测试的代码。不过，我建议您在处理异常或将其传递到调用堆栈方面做得比这里更好

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}

# 6 楼答案

上述所有方法都不会下载浏览器中显示的网页文本。如今，许多数据通过html页面中的脚本加载到浏览器中。上述技术都不支持脚本，它们只下载html文本。HTMLUNIT支持Java脚本。因此，如果您希望下载浏览器中显示的网页文本，则应使用HTMLUNIT

Python中文网

有 Java 编程相关的问题?

http如何用Java编程下载网页

共 (6) 个答案

# 1 楼答案

# 2 楼答案

另请参见：

# 3 楼答案

# 4 楼答案

# 5 楼答案

# 6 楼答案