可以刮取标记内所有数据的正则表达式

the desired output is: [1. Golgappa at Chawla's and Nand's,2. Pyaaz Kachori at Rawat Mishthan Bhandar,2. Pyaaz Kachori at Rawat Mishthan Bhandar,4. Best of Indian Street Food at Masala Chowk,........ so on] and [1. Keema Baati,2. Pyaaz Kachori ,3. Dal Baati Churma...so on]

2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/robots.txt> (referer: None) 2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/pages/street-food-in-jaipur-1483.html> (referer: None) ['1. Golgappa', '2. Pyaaz', '3. Masala', '4. Best', '5. Kaathi', '6. Pav', '7. Omelette', '8. Chicken', '9. Lassi', '10. Shrikhand', '11. Kulfi', '12. Sweets', '13. Fast', '14. Cold'] 2021-08-17 05:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/> (referer: None) ['1. Keema', '2. Pyaaz', '3. Dal', '4. Shrikhand', '5. Ghewar', '6. Mawa', '7. Mirchi', '8. Gatte', '9. Rajasthani', '10. Laal'] 2021-08-17 05:55:33 [scrapy.core.engine] INFO: Closing spider (finished)

import scrapy import re class TestSpider(scrapy.Spider): name = 'test' allowed_domains = ['www.tasteatlas.com','www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.jaipurcityblog.com','www.trip101.com','www.adequatetravel.com'] start_urls = ['https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/', 'https://www.holidify.com/pages/street-food-in-jaipur-1483.html' ] def parse(self, response): if response.css('h2::text').re(r'\d+\.\s*\w+'): print(response.css('h2::text').re(r'\d+\.\s*\w+')) elif response.css('b::text').re(r'\d+\.\s*\w+'): print(response.css('b::text').re(r'\d+\.\s*\w+'))

2条回答

网友

1楼 · 编辑于 2024-09-30 23:35:29

下面是另一种使用scrapy的方法，与问题中的方法不同，Fazlul的回答不会将子节点中的文本与父节点中的文本分开

    def parse(self, response):
        r = re.compile(r'\d+\.')
        # get header texts:
        h2s = [e.xpath('string()').extract_first() for e in response.xpath('//h2')]
        nh2s = list(filter(r.match, h2s))       # get numbered headers
        if nh2s: print(nh2s)
        …

网友

2楼 · 编辑于 2024-09-30 23:35:29

这可以通过报纸图书馆完成

import re
from newspaper import Article
import nltk
from pprint import pprint

urls=['https://www.jaipurcityblog.com/9-iconic-famous-dishes-of-jaipur-that-you-have-to-try/',

                  'https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',

                  'https://www.lih.travel/famous-foods-in-jaipur/',
                  'https://www.holidify.com/pages/street-food-in-jaipur-1483.html']
extacted_data=[]
for url in urls:
    site = Article(url)

    site.download()
    site.parse()
    site.nlp()
    data= site.text
    pattern=re.findall(r'\d+\.\s*[a-zA-Z]+.*',data)
    print(pattern)

输出：

['1. Dal Baati Churma', '2. Pyaaz Ki Kachori', '3. Gatte ki Sabji', '4. Mawa Kachori', '5. Kalakand', '6. Lassi', '7. Aam ki Launji', '8. Chokhani Kheer', '9. Mirchi   Vada']
['1. Keema Baati', '2. Pyaaz Kachori', '3. Dal Baati Churma', '4. Shrikhand', '5. Ghewar', '6. Mawa Kachori', '7. Mirchi Bada', '8. Gatte Ki Subzi', '9. Rajasthani Thali',     '10. Laal Maas']
['1. Rajasthani Thali (Plate) at Chokhi Dhani Village Resort', '2. Laal Maans at Handi', '3. Lassi at Lassiwala', '4. Anokhi Café for Penne Pasta & Cheese Cake', '5. Daal  Baluchi at Baluchi Restaurant', '6. Pyaz Kachori at Rawat', '7. Chicken Lollipop at Niro’s', '8. Hibiscus Ice Tea at Tapri', '9. Omelet at Sanjay Omelette', '1981. This    special egg eatery of Jaipur also treats some never tried before egg specialties. If you are an egg-fan with a sweet tooth, then this is your place. Slurp the “Egg Rabri”  of Sanjay Omelette and feel the heavenly juice of eggs in your mouth. Appreciate the good taste of egg in never before way with just a visit to “Sanjay Omelette”.', '10.   Paalak Paneer & Missi Roti at Sharma Dhabha']
["1. Golgappa at Chawla's and Nand's", '2. Pyaaz Kachori at Rawat Mishthan Bhandar', '3. Masala Chai at Gulab Ji Chaiwala', '4. Best of Indian Street Food at Masala    Chowk', '5. Kaathi Roll at Al Bake', "6. Pav Bhaji at Pandit's", "7. Omelette at Sanjay's", '8. Chicken Tikka at Sethi Bar-Be-Que', '9. Lassi at Lassiwala', '10. Shrikhand     at Falahaar', '11. Kulfi Faluda at Bapu Bazaar', '12. Sweets from Laxmi Mishthan Bhandar (LMB)', "13. Fast Food at Aunty's Cafe", '14. Cold Coffee at Gyan Vihar Dairy  (GVD)']

相关问题更多 >

编程相关推荐

热门问题

热门文章