我正在使用scrapy
提取信息
编辑:我试着用这种方式提取文本,但什么都没有:response.xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div/div[2]/dl[1]/dd/span")
如果有人想重新创建,只需复制粘贴此代码并运行。你可以选择任何页面,只需要提取这些信息
import scrapy
from scrapy.spiders import SitemapSpider
from scrapy.crawler import CrawlerProcess
import googletrans
# from googletrans import Translator
from translate import Translator
class Myspider(SitemapSpider):
name = 'spidername'
sitemap_urls = ['https://www.arabam.com/sitemap/otomobil_1.xml']
sitemap_rules = [
('/otomobil/', 'parse'),
# ('/category/', 'parse_category'),
]
def parse(self,response):
for td in response.xpath("/html/body/div[3]/div[6]/div[4]/div/div[2]/table/tbody/tr/td[4]/div/a/@href").extract():
# / html / body / div[3] / div[6] / div[4] / div / div[2] / table / tbody / tr / td[4] / div / a
checks = str(td.split("/")[3]).split("-")
for items in checks:
if items.isdigit():
if int(items) > 2001:
url = "https://www.arabam.com/"+ td
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self,response):
##some other stuff im scraping
overview1 = response.xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div/div[2]/dl[1]/dd/span")
print(response)
print("s"+ str(overview1))
process = CrawlerProcess({
})
process.crawl(Myspider)
process.start() # the script will block here until the c
罗琳完成了
编辑:预期的输出是获得这些精确的键值对
编辑:在回答中使用标记我得到以下信息:
[......or Kaputu: ', ' Orijinal ', ' ', 'Sol Ön Çamurluk: ', ' Boyanmış ', ' ', 'Ön Tampon: ', ' Orijinal ', ' ', 'Arka Tampon: ', ' Orijinal ', ' ', 'Belirtilmemiş', 'Orijinal', 'Boyalı', 'Değişmiş', ' ', ' ', ' Tramer tutarı yok ', ' ', ' ', ' ', 'ARAÇ BİLGİLERİ', ' ', ' ', 'DONANIM', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', ' ', 'KREDİ', ' ', ' ', 'SPONSORLU BAĞLANTILAR', " googletag.cmd.push(function () { googletag.display('div-gpt-ad-1547030262883-0'); }); ", " googletag.cmd.push(function () { googletag.display('div-gpt-ad-1547030358839-0'); }); "]
编辑: 我已经试过了,但还是不走运
element = d.find_element_by_xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div")
d.execute_script("arguments[0].scrollIntoView();", element)
element = d.find_element_by_xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div")
print(element)
overview1 = element.text
编辑: 由于元素位于屏幕中间,所以无法进入视图。有没有办法先滚动到底部,然后再滚动到中间。我尝试过此代码不起作用:
element = d.find_element_by_xpath('/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div') # you can use ANY way to locate element
coordinates = element.location_once_scrolled_into_view # returns dict of X, Y coordinates
d.execute_script('window.scrollTo({}, {});'.format(coordinates['x'], coordinates['y']))
我使用selenium编写了以下代码来测试XPath(我以前没有使用scrapy):
这将提供以下输出:
输出是您在图片中突出显示的内容,因此我建议您使用以下路径:
相关问题 更多 >
编程相关推荐