如何提取文本，用python将带有链接和文本的链接和br之后的另一个文本提取出来

<html> <body> GOVERNOR: <a href="http://governor.alabama.gov/"> Robert Bentley (R)* </a> - Ex-Morgan County Commissioner & State Correctional Officer <a href="http://www.facebook.com/stacy.george.3139"> Stacy George (R) </a> - Ex-Morgan County Commissioner & State Correctional Officer Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate <a href="http://www.bassforbama.com/"> Kevin Bass (D) </a> - Businessman & Ex-Pro Baseball Player <a href="http://www.parkergriffithforcongress.com/"> Parker Griffith (D) </a> - Ex-Congressman, Ex-State Sen., Physician & Ex-Republican </body> </html>

> Robert Bentley (R)* http://governor.alabama.gov/ > Stacy George (R) http://www.facebook.com/stacy.george.3139 - Ex-Morgan County Commissioner & State Correctional Officer > Kevin Bass (D) http://www.bassforbama.com/ - Businessman & Ex-Pro Baseball Player > Parker Griffith (D) http://www.parkergriffithforcongress.com/ - Ex-Congressman, Ex-State Sen., Physician & Ex-Republican

1条回答

网友

1楼 · 发布于 2024-10-01 05:04:22

抓取每个链接之外的所有文本节点：

from itertools import takewhile
from bs4 import NavigableString

not_link = lambda t: getattr(t, 'name') not in ('a', 'strong')

for link in soup.find_all("a"):
    print 'Link contents:'
    text = link.text.strip()
    for sibling in takewhile(not_link, link.next_siblings):
        if isinstance(sibling, NavigableString):
            text += unicode(sibling).strip()
        else:
            text += sibling.text.strip()
    print text

打印：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章