通过登录主页来刮除主页的内部链接

2024-04-18 22:58:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我会让这件事变得简单。我有一个登录页面。我登录了。我看到了主页。主页有两个链接。我想打开这两个链接。每个链接有两个数据。我只想从两个链接,这是在主页上,也来登录后的四个数据。我可以刮到链接步骤。我可以刮取链接,但不能刮取链接内的数据。我该怎么做?谢谢

我的代码:p.S我只是凭直觉做的,我不知道这是否可能

import scrapy




class ClassroomSpider(scrapy.Spider):
    name = 'classroom'
    start_urls =['http://classroom.dwit.edu.np/login/index.php'] 
    login_url = 'http://classroom.dwit.edu.np/login/index.php'
    
    def parse(self, response): //code to login into the website
    
        data = {
            'username': mynameisaj
            'password': somerandomvalue
        }
        yield scrapy.FormRequest(url=self.login_url,formdata = data, callback = self.parse_quotes)
        
    def parse_quotes(self,response):
        Link =  response.xpath('//*[@class="event"]/a/@href').extract()      //link in the homepage
        
        for item in zip(Link):
            scraped_info = {
                'Link':item[0],

            }
            yield scraped_info
        next_page_url = response.xpath('//*[@class="event"]/a/@href').extract()  // link in the homepage
        if next_page_url:
            yield scrapy.Request(url = next_page_url, callback = self.parse_data)
            
    def parse_data(self,response):
        Data = response.xpath('//*[@class="no-overflow"]/p/text()').extract() //data inside the link  in the homepage
        
        for item in zip(Data): 
            scraped_info1 = {
                'Data':item[0],
            }
            yield scraped_info1

更新

html元素是:

<div id="intro" class="box generalbox boxaligncenter"><div class="no-overflow"><p>1) Write a program to print the area and perimeter of a triangle having sides of 3, 4 and 5 units by creating a class named 'Triangle' without any parameter in its constructor.</p>
<p><br>2) Write a program that would print the information (name, year of joining, salary, address) of three employees by creating a class named 'Employee'. <br>Create properties as needed for Employee class and set values to those properties using constructor with arguments.</p>
<p>The output should be as follows:</p>
<table border="0" style="width: 348px; height: 43px;">
<tbody>
<tr>
<td><strong><span data-mce-mark="1">Name</span></strong></td>
<td><strong><span data-mce-mark="1">Year of joining</span></strong></td>
<td><strong><span data-mce-mark="1">Address</span></strong></td>
</tr>
<tr>
<td><span data-mce-mark="1">Robert</span></td>
<td><span data-mce-mark="1">1994</span></td>
<td><span data-mce-mark="1">64C- WallsStreet</span></td>
</tr>
<tr>
<td><span data-mce-mark="1">Sam</span></td>
<td><span data-mce-mark="1">2000</span></td>
<td>Kathmandu</td>
</tr>
</tbody>
</table>
<p></p>
<p>3) Create a class 'Degree' having a method 'getDegree' that prints "I got a degree". It has two subclasses namely 'Undergraduate' and 'Postgraduate' each having a method with the same name that prints "I am an Undergraduate" and "I am a Postgraduate" respectively. Call the method by creating an object of each of the three classes.</p>
<p>Note: Use separate class with main method</p></div></div>

它刮下的所有元素都是最后一个p元素


Tags: oftheinselfurldata链接response
1条回答
网友
1楼 · 发布于 2024-04-18 22:58:58

如果您想为两个链接组合输出,则需要使用request.meta(未测试):

def parse_quotes(self,response):

    # first you need to get ALL links you want to process
    your_links =  response.xpath('//*[@class="event"]/a/@href').extract()

    first_link = your_links.pop(0)

    # and start processing from the very first link
    yield scrapy.Request(url = first_link, callback = self.parse_data, meta={"links": your_links})

def parse_data(self,response):

    item = {}

    # If we already have some Item data we need to continue with it
    if "item" in response.meta:
        item = response.meta["item"]

    # Below you need to parse HTML and update your Item
    Data = response.xpath('//*[@class="no-overflow"]/p/text()').extract()

    for item in zip(Data): 
        item = {
            'Data':item[0],
        }

    # Now we need to check if we need to process other Links
    if len(response.meta["links"]) > 0:
        next_link_url  = response.meta["links"].pop(0)
        yield scrapy.Request(url = next_link_url, callback = self.parse_data, meta={"links": response.meta["links"], "item": item})
    else:

        # No more links to process, just save output
        yield item

相关问题 更多 >