通过登录主页来刮除主页的内部链接

import scrapy class ClassroomSpider(scrapy.Spider): name = 'classroom' start_urls =['http://classroom.dwit.edu.np/login/index.php'] login_url = 'http://classroom.dwit.edu.np/login/index.php' def parse(self, response): //code to login into the website data = { 'username': mynameisaj 'password': somerandomvalue } yield scrapy.FormRequest(url=self.login_url,formdata = data, callback = self.parse_quotes) def parse_quotes(self,response): Link = response.xpath('//*[@class="event"]/a/@href').extract() //link in the homepage for item in zip(Link): scraped_info = { 'Link':item[0], } yield scraped_info next_page_url = response.xpath('//*[@class="event"]/a/@href').extract() // link in the homepage if next_page_url: yield scrapy.Request(url = next_page_url, callback = self.parse_data) def parse_data(self,response): Data = response.xpath('//*[@class="no-overflow"]/p/text()').extract() //data inside the link in the homepage for item in zip(Data): scraped_info1 = { 'Data':item[0], } yield scraped_info1

<div id="intro" class="box generalbox boxaligncenter"><div class="no-overflow"><p>1) Write a program to print the area and perimeter of a triangle having sides of 3, 4 and 5 units by creating a class named 'Triangle' without any parameter in its constructor.</p> <p><br>2) Write a program that would print the information (name, year of joining, salary, address) of three employees by creating a class named 'Employee'. <br>Create properties as needed for Employee class and set values to those properties using constructor with arguments.</p> <p>The output should be as follows:</p> <table border="0" style="width: 348px; height: 43px;"> <tbody> <tr> <td><strong><span data-mce-mark="1">Name</span></strong></td> <td><strong><span data-mce-mark="1">Year of joining</span></strong></td> <td><strong><span data-mce-mark="1">Address</span></strong></td> </tr> <tr> <td><span data-mce-mark="1">Robert</span></td> <td><span data-mce-mark="1">1994</span></td> <td><span data-mce-mark="1">64C- WallsStreet</span></td> </tr> <tr> <td><span data-mce-mark="1">Sam</span></td> <td><span data-mce-mark="1">2000</span></td> <td>Kathmandu</td> </tr> </tbody> </table> <p></p> <p>3) Create a class 'Degree' having a method 'getDegree' that prints "I got a degree". It has two subclasses namely 'Undergraduate' and 'Postgraduate' each having a method with the same name that prints "I am an Undergraduate" and "I am a Postgraduate" respectively. Call the method by creating an object of each of the three classes.</p> <p>Note: Use separate class with main method</p></div></div>

1条回答

网友
1楼 · 发布于 2024-04-18 22:58:58

如果您想为两个链接组合输出，则需要使用request.meta（未测试）：
def parse_quotes(self,response): # first you need to get ALL links you want to process your_links = response.xpath('//*[@class="event"]/a/@href').extract() first_link = your_links.pop(0) # and start processing from the very first link yield scrapy.Request(url = first_link, callback = self.parse_data, meta={"links": your_links}) def parse_data(self,response): item = {} # If we already have some Item data we need to continue with it if "item" in response.meta: item = response.meta["item"] # Below you need to parse HTML and update your Item Data = response.xpath('//*[@class="no-overflow"]/p/text()').extract() for item in zip(Data): item = { 'Data':item[0], } # Now we need to check if we need to process other Links if len(response.meta["links"]) > 0: next_link_url = response.meta["links"].pop(0) yield scrapy.Request(url = next_link_url, callback = self.parse_data, meta={"links": response.meta["links"], "item": item}) else: # No more links to process, just save output yield item

相关问题更多 >

编程相关推荐

热门问题

热门文章