Scraping SpeechesUSA.com网站

SPEECH_SOURCE = 'http://www.speeches-usa.com/' def get_speeches(): cj = CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) p = opener.open(SPEECH_SOURCE) soup = BeautifulSoup(p.read(), PARSER_TYPE) info = soup.find_all('a', class_='ListText') elements = [] for element in info: elements.append(element) for i in x range(0, min(len(elements), 5)): print elements[i]

<a class="ListText" href="Transcripts/john_adams-inaugural.html">John Adams - Inaugural Address<br/> </a> 0 <a class="ListText" href="Transcripts/samuel_adams-independence.html">Samuel Adams - American Independence<br/> </a> 1 <a class="ListText" href="Transcripts/spiro_agnew-networknews.html">Spiro Agnew - Television News Coverage<br/> </a> 2 <a class="ListText" href="Transcripts/susan_b_anthony-vote.html">Susan B. Anthony - Women's Right to Vote</a> 3 <a class="ListText" href="Transcripts/spiro_agnew-networknews.html"></a> 4

3条回答

网友

1楼 · 编辑于 2024-10-03 02:37:39

试试这个

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

网友

2楼 · 编辑于 2024-10-03 02:37:39

以下内容应提供完整的URL：

import urllib2
from BeautifulSoup import BeautifulSoup
import urlparse


def get_speeches(input_url):
    p = urllib2.urlopen(input_url)
    soup = BeautifulSoup(p, 'html.parser')
    info = soup.find_all('a', class_='ListText')

    for element in info:
        print urlparse.urljoin(input_url, element['href'])

SOURCE_URL = 'http://speeches-usa.com'
get_speeches(SOURCE_URL)

element.get_text()完全按照它所说的做—它获取元素的文本。如果需要属性，可以使用方括号，如element['href']

EDIT：下面的注释指出，这遗漏了一些元素，因为并非所有链接都有ListText类。下面的代码将查找所有链接，检查'Transcripts'是否在提供的链接中（我假设您需要的是指向转录本的链接），如果是，则将其附加到列表中。这可能具有重复的特性，因此set()仅用于打印唯一的条目

import urllib2
from BeautifulSoup import BeautifulSoup
import urlparse


def get_speeches(input_url):
    p = urllib2.urlopen(url=input_url)
    soup = BeautifulSoup(p, 'html.parser')
    info = soup.find_all('a', href=True)

    all_transcripts = list()

    for element in info:
        if 'Transcripts' in element['href']:
            all_transcripts.append(urlparse.urljoin(input_url, element['href']))

    for transcript_url in set(all_transcripts):
        print transcript_url

SOURCE_URL = 'http://speeches-usa.com'
get_speeches(SOURCE_URL)

网友

3楼 · 编辑于 2024-10-03 02:37:39

import bs4, requests
r = requests.get('http://speeches-usa.com/')
soup = bs4.BeautifulSoup(r.text, 'lxml')

a_tags = soup.find('table', width="925").find_all('a', text=True, href=re.compile('\.html'))
for a in a_tags:  
    link = a.get('href')
    text = a.get_text(strip=True).replace('\n        ', '')
    print(link, text, sep="\t\t")

输出：

Transcripts/susan_b_anthony-vote.html       Susan B. Anthony - Women'sRight to Vote
Transcripts/albert_beveridge-question.html      Albert J. Beveridge - ThePhillipine Question
Transcripts/william_jennings_bryan-cross.html       William Jennings Bryan - Crossof Gold
Transcripts/william_jennings_bryan-19002.html       William Jennings Bryan - 1900Democratic Presidential Acceptance
Transcripts/tony_blair-irish.html       Tony Blair - Addressto Irish Parliament
Transcripts/napolean_bonaparte-farewell.html        Napolean Bonaparte - Farewell to the Old Guard
Transcripts/sarah_brady-1996dnc.html        Sarah Brady - 1996DNC Keynote address
Transcripts/pat_buchanan-citadel.html       Pat Buchannan - Arepublic not an Empire
Transcripts/edmund_burke.html       Edumund Burke - Thedeath of Marie Antoinette
Transcripts/barbara_bush-1992rnc.html       Barbara Bush - 1992RNC Speech
Transcripts/barbara_bush-wellesley.html     Barbara Bush - WelleslyCollege
Transcripts/george_bush-somalia.html        George Bush - Conditionsin Somalia
Transcripts/george_bush-1991sou.html        George Bush - 1991State of the Union
Transcripts/george_bush-saudi.html      George Bush - Defenseof Saudi Arabia
Transcripts/george_w_bush-knoxville.html        George W. Bush - Anew approach
Transcripts/stokeley_carmichael-going.html      Stokley Carmichael - BlackPower
Transcripts/stokeley_carmichael-weaint.html     Stokley Carmichael - "Weain't goin'"
Transcripts/jimmy_carter-energy.html        Jimmy Carter - EnergyCrisis

相关问题更多 >

编程相关推荐

热门问题

热门文章