刮取标签和带有链接的数据列表时出现问题

2024-06-02 10:38:45 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我用Python/Beautifulsoup创建的HTML的一个示例:

<dl>
<dd>
    <strong>
        <a name="45790" href="http://www.eslcafe.com/jobs/china/index.cgi?read=45790">Monthly 18000rmb ESL teachers for Shanghai Webi centers</a>
    </strong>
    <br>
    Webi English Shanghai -- Tuesday, 7 March 2017, at 2:17 p.m.
</dd>

<dd></dd>
<dd></dd>
<dd></dd>
</dl>

我能够刮取<a href>,但是尽管运行了不同的循环,我仍然无法获取<br>之后的文本

这是我的节目:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.eslcafe.com/jobs/china/').read()

soup = bs.BeautifulSoup(sauce, 'html.parser')

dl = soup.dl

ads = []

for words in dl.find_all('a'):
    links = words.get('href')
    link_text = words.text
    link_text = link_text.lower()

    if 'university' in link_text:
        ads.append([links, link_text])

    if 'universities' in link_text:
        ads.append([links, link_text])

    if 'college' in link_text:
        ads.append([links, link_text])

    if 'colleges' in link_text:
        ads.append([links, link_text])

for ad in ads:
    for job in ad:
        print(job)
        print("")             

如果文本包含多个我的搜索词,那么在列表中添加重复项也会有问题,但我可以稍后再处理

我想我想要一个列表,其中包含包含linklink_textdate_text的列表

ads = [[link, link_text, date_text], [link, link_text, date_text]]

现在,我只能得到链接和链接文本

有什么建议吗


Tags: textin文本列表fordateiflink
2条回答
In [31]: for dd in soup.find_all('dd'):
    ...:     link = dd.a.get('href')
    ...:     link_text = dd.a.text
    ...:     *_, dd_text = dd.stripped_strings

输出:

http://www.eslcafe.com/jobs/china/index.cgi?read=45391
Teach English in Shenyang, China: Great salary, Support, and Structured program
Greenheart Travel   Thursday, 9 February 2017, at 1:05 p.m.

dd_text是dd标记的最后一个文本节点,因此我使用*_表示它前面的所有文本节点

编辑:

In [20]: for dd in soup.find_all('dd'):
    ...:     
    ...:     d = {} # store data in a dict
    ...:     d['link'] = dd.a.get('href')
    ...:     d['link_text'] = dd.a.text
    ...:     *_, dd_text = dd.stripped_strings
    ...:     d['date_text'] = dd_text
    ...:     print(d)

输出:

{'date_text': 'EnglishTeacherChina.com   Sunday, 12 February 2017, at 1:45 '
              'p.m.',
 'link': 'http://www.eslcafe.com/jobs/china/index.cgi?read=45426',
 'link_text': '❤ ❤ ❤ Teach English In China 12,000-20,000 RMB/month - Adults '
              'or Kids - Free Housing & Airfare - Free TEFL TESOL '
              'Certification - Where You Want - YOUR NEEDS ARE OUR TOP '
              'PRIORITY ❤ ❤ ❤'}

您可以使用contents

import bs4
soup = bs4.BeautifulSoup('<dl> .... </dl>') # your markup  
print(soup.br.contents[0])

给出:

Webi English Shanghai   Tuesday, 7 March 2017, at 2:17 p.m.

相关问题 更多 >