回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p><strong>请原谅我的错误,如有疑问请添加评论</strong></p>
<p>我试图通过正则表达式从各种博客中获取h2和粗体标记中包含的数据,以数字开头,但通过使用正则表达式,我只获得句子的起始词,而不是完整的标题</p>
<pre><code> response.css('h2::text').re(r'\d+\.\s*\w+')
</code></pre>
<p>我不知道我错在哪里。预期的输出应该是</p>
<pre><code> the desired output is: [1. Golgappa at Chawla's and Nand's,2. Pyaaz
Kachori at Rawat Mishthan Bhandar,2. Pyaaz Kachori at Rawat Mishthan
Bhandar,4. Best of Indian Street Food at Masala Chowk,........ so on]
and [1. Keema Baati,2. Pyaaz Kachori ,3. Dal Baati Churma...so on]
</code></pre>
<p>我得到的是</p>
<pre><code>2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/robots.txt> (referer: None)
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/pages/street-food-in-jaipur-1483.html> (referer: None)
['1. Golgappa', '2. Pyaaz', '3. Masala', '4. Best', '5. Kaathi', '6. Pav', '7. Omelette', '8. Chicken', '9. Lassi', '10. Shrikhand', '11. Kulfi', '12. Sweets', '13. Fast', '14. Cold']
2021-08-17 05:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/> (referer: None)
['1. Keema', '2. Pyaaz', '3. Dal', '4. Shrikhand', '5. Ghewar', '6. Mawa', '7. Mirchi', '8. Gatte', '9. Rajasthani', '10. Laal']
2021-08-17 05:55:33 [scrapy.core.engine] INFO: Closing spider (finished)
</code></pre>
<p>如果你能建议一个正则表达式会是一个很大的帮助</p>
<p>如果你想访问该网站,那么这些就是我正在抓取的网站</p>
<p><a href="https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/" rel="nofollow noreferrer">https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/</a>和
<a href="https://www.holidify.com/pages/street-food-in-jaipur-1483.html" rel="nofollow noreferrer">https://www.holidify.com/pages/street-food-in-jaipur-1483.html</a></p>
<p>这是我的代码,以防你想看到</p>
<pre><code>import scrapy
import re
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.tasteatlas.com','www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.jaipurcityblog.com','www.trip101.com','www.adequatetravel.com']
start_urls = ['https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',
'https://www.holidify.com/pages/street-food-in-jaipur-1483.html'
]
def parse(self, response):
if response.css('h2::text').re(r'\d+\.\s*\w+'):
print(response.css('h2::text').re(r'\d+\.\s*\w+'))
elif response.css('b::text').re(r'\d+\.\s*\w+'):
print(response.css('b::text').re(r'\d+\.\s*\w+'))
</code></pre>