有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java如何使用Jsoup提取链接?

我正在使用JSoup爬网并获得结果。我想进行关键字搜索。比如我爬 http://www.business-standard.com/用于以下关键字:

google hyderabad

它应该为我提供链接:

http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html.

我写了下面的代码,但没有给出适当的结果

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class App {

  public static void main(String[] args) {

    Document doc;
    try {
        doc = Jsoup.connect("http://www.business-standard.com").userAgent("Mozilla").get();
        String title = doc.title();
        System.out.println("title : " + title);

        Elements links = doc.select("a:contains(google)");
        for (Element link : links) {
            System.out.println("\nlink : " + link.attr("href"));
            System.out.println("text : " + link.text());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
  }
}

结果如下:

title : India News, Latest News Headlines, BSE live, NSE Live, Stock Markets Live, Financial News, Business News & Market Analysis on Indian Economy - Business Standard News

link : /photo-gallery/current-affairs/mumbai-central-turns-into-wi-fi-zone-courtesy-goo‌​gle-power-2574.htm
text : Mumbai Central turns into Wi-Fi zone, courtesy Google power

link : plus.google.com/+businessstandard/posts
text : Google+

Jsoup 1.8.2


共 (1) 个答案

  1. # 1 楼答案

    请尝试以下url:

    http://www.business-standard.com/search?q=<keyword>
    

    示例代码

    Document doc;
    try {
        String keyword = "google hyderabad";
        doc = Jsoup //
                .connect("http://www.business-standard.com/search?q=" + URLEncoder.encode(keyword, "UTF-8")) //
                .userAgent("Mozilla") //
                .get();
    
        String title = doc.title();
        System.out.println("title : " + title);
    
        Elements links = doc.select("a:contains(google)");
        for (Element link : links) {
            System.out.println("\nlink : " + link.absUrl("href"));
            System.out.println("text : " + link.text());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    

    输出

    你要找的链接位于第二位

    title : Search
    
    link : http://www.business-standard.com/article/pti-stories/google-to-invest-more-in-india-set-up-new-campus-115121600841_1.html
    text : Google to invest more in India, set up new campus in Hyderabad
    
    link : http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html
    text : Google to get 7.2 acres in Hyderabad IT corridor for its campus
    
    link : http://www.business-standard.com/article/technology/swine-flu-closes-google-hyderabad-office-for-2-days-109071500023_1.html
    text : Swine flu closes Google Hyderabad office for 2 days
    
    link : http://www.business-standard.com/article/pti-stories/facebook-posts-strong-4q-as-company-closes-gap-with-google-116012800081_1.html
    text : Facebook posts strong 4Q as company closes gap with Google
    
    link : http://www.business-standard.com/article/pti-stories/r-day-bsf-camel-contingent-march-on-google-doodle-116012600104_1.html
    text : R-Day: BSF camel contingent marches on Google doodle
    
    link : http://www.business-standard.com/article/international/daimler-ceo-says-apple-google-making-progress-on-car-116012501298_1.html
    text : Daimler CEO says Apple, Google making progress on car
    
    link : https://plus.google.com/+businessstandard/posts
    text : Google+
    

    讨论

    下面的示例代码只获取第一个结果页面。如果需要获取更多结果,请提取下一个链接页(#hpcontentbox div.next-colum > a),并使用Jsoup对其进行爬网

    您会注意到,在我提供给您的上述链接中还有其他参数:

    • itemPerPages:不言自明(默认为19)
    • page:搜索结果页面索引(如果未提供,则默认为1)
    • company-code:??(可以是空的)

    你可以尝试给这个url赋予itemPerPages更大的值(100或更多)。这可能会减少你的爬行时间

    使用absUrl方法是为了获得绝对URL,而不是相对URL