从BeautifulGroup中删除无关的div标记

2024-10-01 00:28:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个网站上抓取文本,却不知道如何删除一个无关的div标记。代码看起来像:

import requests
from bs4 import BeautifulSoup

team_urls = 
     ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
   'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
   'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']

for url in team_urls:
     page = requests.get(url)
     soup = BeautifulSoup(page.text, 'html.parser')

     for e in soup.find_all('br'):
         e.replace_with('\n')

     lyrics = soup.find(class_='dn')

     print(lyrics)

这给了我一个输出:

^{pr2}$

我想删除div标记。在


Tags: 标记importdivcomhttphtmlwwwrequests
2条回答

完整代码:

import requests
from bs4 import BeautifulSoup

urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
        'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
        'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']

for url in urls:
    page = requests.get(url)
    page.encoding = 'utf-8'
    soup = BeautifulSoup(page.text, 'html.parser')

    div = soup.select_one('#content_h')

    for e in div.find_all('br'):
        e.replace_with('\n')

    lyrics = div.text
    print(lyrics)

请注意,有时使用了错误的编码:

I may be crazy donât mind me

这就是为什么我手动设置它:page.encoding = 'utf-8'。提到这种情况的requests docs片段:

The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

你可以用正则表达式

import requests
import re

from bs4 import BeautifulSoup

team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
             'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
             'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']

for url in team_urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    for e in soup.find_all('br'):
        e.replace_with('\n')

    lyrics = soup.find(class_='dn')
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', lyrics.text)

    print(cleantext)

这将删除<;和>

通过使用python文档中提到的特殊字符

““

一。 (Dot.)在默认模式下,它匹配除换行符之外的任何字符。如果指定了DOTALL标志,则匹配任何字符(包括换行符)。在

在* 使得到的RE与前面的RE重复0次或更多次匹配,尽可能多地重复。ab*将匹配“a”、“ab”或“a”,后跟任意数量的“b”

是吗? 使生成的RE与前面的RE重复0或1次匹配。阿伯?将匹配“a”或“ab”。在

““

来自https://docs.python.org/3/library/re.html

相关问题 更多 >