<p><strong>完整代码:</strong></p>
<pre><code>import requests
from bs4 import BeautifulSoup
urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in urls:
page = requests.get(url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.select_one('#content_h')
for e in div.find_all('br'):
e.replace_with('\n')
lyrics = div.text
print(lyrics)
</code></pre>
<p>请注意,有时使用了错误的编码:</p>
<blockquote>
<p>I may be crazy donât mind me</p>
</blockquote>
<p>这就是为什么我手动设置它:<code>page.encoding = 'utf-8'</code>。提到这种情况的<a href="http://docs.python-requests.org/en/master/api/#requests.Response.text" rel="nofollow noreferrer">requests docs</a>片段:</p>
<blockquote>
<p>The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.</p>
</blockquote>