请原谅我的错误,如有疑问请添加评论
我试图通过正则表达式从各种博客中获取h2和粗体标记中包含的数据,以数字开头,但通过使用正则表达式,我只获得句子的起始词,而不是完整的标题
response.css('h2::text').re(r'\d+\.\s*\w+')
我不知道我错在哪里。预期的输出应该是
the desired output is: [1. Golgappa at Chawla's and Nand's,2. Pyaaz
Kachori at Rawat Mishthan Bhandar,2. Pyaaz Kachori at Rawat Mishthan
Bhandar,4. Best of Indian Street Food at Masala Chowk,........ so on]
and [1. Keema Baati,2. Pyaaz Kachori ,3. Dal Baati Churma...so on]
我得到的是
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/robots.txt> (referer: None)
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/pages/street-food-in-jaipur-1483.html> (referer: None)
['1. Golgappa', '2. Pyaaz', '3. Masala', '4. Best', '5. Kaathi', '6. Pav', '7. Omelette', '8. Chicken', '9. Lassi', '10. Shrikhand', '11. Kulfi', '12. Sweets', '13. Fast', '14. Cold']
2021-08-17 05:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/> (referer: None)
['1. Keema', '2. Pyaaz', '3. Dal', '4. Shrikhand', '5. Ghewar', '6. Mawa', '7. Mirchi', '8. Gatte', '9. Rajasthani', '10. Laal']
2021-08-17 05:55:33 [scrapy.core.engine] INFO: Closing spider (finished)
如果你能建议一个正则表达式会是一个很大的帮助
如果你想访问该网站,那么这些就是我正在抓取的网站
https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/和 https://www.holidify.com/pages/street-food-in-jaipur-1483.html
这是我的代码,以防你想看到
import scrapy
import re
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.tasteatlas.com','www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.jaipurcityblog.com','www.trip101.com','www.adequatetravel.com']
start_urls = ['https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',
'https://www.holidify.com/pages/street-food-in-jaipur-1483.html'
]
def parse(self, response):
if response.css('h2::text').re(r'\d+\.\s*\w+'):
print(response.css('h2::text').re(r'\d+\.\s*\w+'))
elif response.css('b::text').re(r'\d+\.\s*\w+'):
print(response.css('b::text').re(r'\d+\.\s*\w+'))
下面是另一种使用
scrapy
的方法,与问题中的方法不同,Fazlul的回答不会将子节点中的文本与父节点中的文本分开这可以通过报纸图书馆完成
输出:
相关问题 更多 >
编程相关推荐