有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java抓取包含多个页面的多个字母标签的网站

我正在抓取一个网站,它的数据按字母顺序排列在a-Z标签中,每个字母标签还包含几个页面。如何从中提取所有URL

公共静态void main(字符串[]args)引发异常{

String keyword = "a";
String url = "http://www.medindia.net/drug-price/brand-index.asp?alpha=" + keyword; 

Document doc = Jsoup.connect(url).get();
//Elements pages = doc.select("div.pagination a"); 
Element table = doc.select("table").get(1);

for (Element row : table.select("tr")) {
    for (Element tds : row.select("td")) {
        Elements links = tds.select("a[href]");
        for (Element link : links) {
            System.out.println("link : " + link.attr("href"));
            System.out.println("text : " + link.text());
           }
        }
    }

共 (1) 个答案

  1. # 1 楼答案

    因此,我能够找出如何从每个字母表选项卡和每个字母表选项卡内的每个页面中刮取数据。下面是代码。然而,在抓取了几百个链接之后,我得到了一个读取超时错误。有没有一种有效的方法可以做到这一点?我可以将多线程应用于此吗

    public static void main(String[] args) throws Exception {
    
            final int OK = 200;
            String currentURL;
            int page = 1;
            int status = OK;
            Connection.Response response = null;
            Document doc = null;
            String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
            //String keyword = "a";
            for (String keyword : keywords){
                final String url = "https://www.medindia.net/drug-price/brand-index.asp?alpha="+keyword; 
                while (status == OK) {
                    currentURL = url +"&page="+ String.valueOf(page); 
                    response = Jsoup.connect(currentURL)
                            .userAgent("Mozilla/5.0")
                            .execute();
                    status = response.statusCode();
    
    
                    if (status == OK) {
                        doc = response.parse();
    
                        Element table = doc.select("table").get(1);
    
                        for (Element rows : table.select("tr")) {
                            for (Element tds : rows.select("td")) {
                                Elements links = tds.select("a[href]");
                                for (Element link : links) {
                                    System.out.println("link : " + link.attr("href"));
                                    System.out.println("text : " + link.text());
                                }
                            }
                        }
    
                    }
                    page++;
                }
    
            }
        }