Scraping SpeechesUSA.com网站

2024-10-03 02:37:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在努力搜寻speeches-usa.com的标题链接。下面是我的Python代码:

SPEECH_SOURCE = 'http://www.speeches-usa.com/'
def get_speeches():
        cj = CookieJar()
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        p = opener.open(SPEECH_SOURCE)
        soup = BeautifulSoup(p.read(), PARSER_TYPE)
        info = soup.find_all('a', class_='ListText')
        elements = []
        for element in info:
            elements.append(element)
        for i in x range(0, min(len(elements), 5)):
            print elements[i]

(1)我不确定要在soup中放入什么。find \u all()参数可以获取链接-我尝试放入elements.append(element.get \u text()),但这样会产生以下结果,从而删除链接

John Adams - Inaugural
        Address

Samuel Adams - American
        Independence

Spiro Agnew - Television
        News Coverage

Susan B. Anthony - Women's
        Right to Vote

(2)结果似乎不完整,例如,下面的代码中缺少Jane Adams

<a class="ListText" href="Transcripts/john_adams-inaugural.html">John Adams - Inaugural
        Address<br/>
</a>
0
<a class="ListText" href="Transcripts/samuel_adams-independence.html">Samuel Adams - American
        Independence<br/>
</a>
1
<a class="ListText" href="Transcripts/spiro_agnew-networknews.html">Spiro Agnew - Television
        News Coverage<br/>
</a>
2
<a class="ListText" href="Transcripts/susan_b_anthony-vote.html">Susan B. Anthony - Women's
        Right to Vote</a>
3
<a class="ListText" href="Transcripts/spiro_agnew-networknews.html"></a>
4

帮助和指导将不胜感激


Tags: brcom链接htmlelementselementopenerclass
3条回答

试试这个

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

以下内容应提供完整的URL:

import urllib2
from BeautifulSoup import BeautifulSoup
import urlparse


def get_speeches(input_url):
    p = urllib2.urlopen(input_url)
    soup = BeautifulSoup(p, 'html.parser')
    info = soup.find_all('a', class_='ListText')

    for element in info:
        print urlparse.urljoin(input_url, element['href'])

SOURCE_URL = 'http://speeches-usa.com'
get_speeches(SOURCE_URL)

element.get_text()完全按照它所说的做—它获取元素的文本。如果需要属性,可以使用方括号,如element['href']

EDIT:下面的注释指出,这遗漏了一些元素,因为并非所有链接都有ListText类。下面的代码将查找所有链接,检查'Transcripts'是否在提供的链接中(我假设您需要的是指向转录本的链接),如果是,则将其附加到列表中。这可能具有重复的特性,因此set()仅用于打印唯一的条目

import urllib2
from BeautifulSoup import BeautifulSoup
import urlparse


def get_speeches(input_url):
    p = urllib2.urlopen(url=input_url)
    soup = BeautifulSoup(p, 'html.parser')
    info = soup.find_all('a', href=True)

    all_transcripts = list()

    for element in info:
        if 'Transcripts' in element['href']:
            all_transcripts.append(urlparse.urljoin(input_url, element['href']))

    for transcript_url in set(all_transcripts):
        print transcript_url

SOURCE_URL = 'http://speeches-usa.com'
get_speeches(SOURCE_URL)
import bs4, requests
r = requests.get('http://speeches-usa.com/')
soup = bs4.BeautifulSoup(r.text, 'lxml')

a_tags = soup.find('table', width="925").find_all('a', text=True, href=re.compile('\.html'))
for a in a_tags:  
    link = a.get('href')
    text = a.get_text(strip=True).replace('\n        ', '')
    print(link, text, sep="\t\t")

输出:

Transcripts/susan_b_anthony-vote.html       Susan B. Anthony - Women'sRight to Vote
Transcripts/albert_beveridge-question.html      Albert J. Beveridge - ThePhillipine Question
Transcripts/william_jennings_bryan-cross.html       William Jennings Bryan - Crossof Gold
Transcripts/william_jennings_bryan-19002.html       William Jennings Bryan - 1900Democratic Presidential Acceptance
Transcripts/tony_blair-irish.html       Tony Blair - Addressto Irish Parliament
Transcripts/napolean_bonaparte-farewell.html        Napolean Bonaparte - Farewell to the Old Guard
Transcripts/sarah_brady-1996dnc.html        Sarah Brady - 1996DNC Keynote address
Transcripts/pat_buchanan-citadel.html       Pat Buchannan - Arepublic not an Empire
Transcripts/edmund_burke.html       Edumund Burke - Thedeath of Marie Antoinette
Transcripts/barbara_bush-1992rnc.html       Barbara Bush - 1992RNC Speech
Transcripts/barbara_bush-wellesley.html     Barbara Bush - WelleslyCollege
Transcripts/george_bush-somalia.html        George Bush - Conditionsin Somalia
Transcripts/george_bush-1991sou.html        George Bush - 1991State of the Union
Transcripts/george_bush-saudi.html      George Bush - Defenseof Saudi Arabia
Transcripts/george_w_bush-knoxville.html        George W. Bush - Anew approach
Transcripts/stokeley_carmichael-going.html      Stokley Carmichael - BlackPower
Transcripts/stokeley_carmichael-weaint.html     Stokley Carmichael - "Weain't goin'"
Transcripts/jimmy_carter-energy.html        Jimmy Carter - EnergyCrisis

相关问题 更多 >