不了解网页结构的网页抓取

2条回答

网友

1楼 · 编辑于 2024-10-01 13:27:42

使用scrapy这样的爬虫程序（只用于处理并发下载），您可以编写这样一个简单的spider，并可能从Wikipedia开始作为一个好的起点。这个脚本是使用scrapy、nltk和whoosh的完整示例。它永远不会停止，并将使用whoosh索引链接以供以后搜索它是一个小谷歌：

_Author = Farsheed Ashouri
import os
import sys
import re
## Spider libraries
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from main.items import MainItem
from scrapy.http import Request
from urlparse import urljoin
## indexer libraries
from whoosh.index import create_in, open_dir
from whoosh.fields import *
## html to text conversion module
import nltk

def open_writer():
    if not os.path.isdir("indexdir"):
        os.mkdir("indexdir")
        schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True))
        ix = create_in("indexdir", schema)
    else:
        ix = open_dir("indexdir")
    return ix.writer()

class Main(BaseSpider):
    name        = "main"
    allowed_domains = ["en.wikipedia.org"]
    start_urls  = ["http://en.wikipedia.org/wiki/Snakes"]

    def parse(self, response):
        writer = open_writer()  ## for indexing
        sel = Selector(response)
        email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$')
        #general_link_validation = re.compile(r'')
        #We stored already crawled links in this list
        crawledLinks    = set()
        titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract()
        contents = sel.xpath('//body/div[@id="content"]').extract()
        if contents:
            content = contents[0]
        if titles: 
            title = titles[0]
        else:
            return
        links   = sel.xpath('//a/@href').extract()


        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            url = urljoin(response.url, link)
            #print url
            ## our url must not have any ":" character in it. link /wiki/talk:company
            if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url):
                crawledLinks.add(url)
                  #print url, depth
                yield Request(url, self.parse)
        item = MainItem()
        item["title"] = title
        print '*'*80
        print 'crawled: %s | it has %s links.' % (title, len(links))
        #print content
        print '*'*80
        item["links"] = list(crawledLinks)
        writer.add_document(title=title, content=nltk.clean_html(content))  ## I save only text from content.
        #print crawledLinks
        writer.commit()
        yield item

This is the file对于已完成的垃圾示例：

网友

2楼 · 编辑于 2024-10-01 13:27:42

你基本上是在问“我怎么写搜索引擎”这是。。。不是小事。在

正确的方法是使用Google（或者Bing，或者Yahoo！）s、或…）搜索API并显示前n个结果。但是，如果你只是在做一个个人项目来教自己一些概念（虽然不确定哪些概念是确切的），那么以下是一些建议：

搜索适当标记（<p>，<div>，等等）的文本内容，寻找相关的关键字（duh）
使用相关关键字检查是否存在可能包含您要查找的内容的标记。例如，如果您要查找一个列表，那么包含<ul>或{}甚至<table>的页面可能是一个不错的候选者
建立一个同义词词典和搜索每一页你的关键字同义词。把你自己限制在“我们”可能意味着一个只包含“美国”的页面被人为地降低了排名
在你的网页中，大多数的关键词都不在其中。这些页面（可以说）更有可能包含您正在寻找的答案

祝你好运（你需要它）！在

相关问题更多 >

编程相关推荐

热门问题

热门文章

不了解网页结构的网页抓取

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >