(bs4)试图区分HTML页面中的不同容器

2024-05-18 16:17:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个来自Parlament之家的网页。它有关于MP申报利益的信息,我想为我正在考虑的项目存储所有MP利益。你知道吗

root = 'https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm'

根是一个例子网页。我希望我的输出是一本字典,因为有兴趣在不同的副标题和条目可以是一个列表。你知道吗

问题:如果你看这个页面,第一个兴趣点(就业和收入)不是包装在一个容器中,而是标题是一个

标记,没有连接到它下面的文本,所以我可以调用soup.find_all('p', {xlms='<p, {'xmlns':'http://www.w3.org/1999/xhtml') 但它会返回费用的标题,以及其他一些标题,比如她的名字,而不是它下面的文本。 这使得遍历标题和存储信息变得困难

遍历页面、存储每个标题以及每个标题下的信息的最佳方式是什么?你知道吗


Tags: 项目https文本信息网页标题mproot
1条回答
网友
1楼 · 发布于 2024-05-18 16:17:41

像这样的方法可能有用:

import urllib.request
from bs4 import BeautifulSoup

ret = {}
page = urllib.request.urlopen("https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm")
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'lxml')
valid = False
value = ""

for i in soup.findAll('p'):
    if i.find('strong') and i.text is not None:
        # ignore first pass
        if valid:
           ret[key] = value
           value = ""
        valid = True
        key = i.text
    elif i.text is not None:
        value = value + " " + i.text

# get last entry
if key is not None:
    ret[key] = value

for x in ret:
    print (x)
    print (ret[x])

输出

4. Visits outside the UK
Name of donor: (1) Stop Aids (2) Aids Alliance Address of donor: (1) Grayston Centre, 28 Charles St, London N1 6HT (2) Preece House, 91-101 Davigdor Rd, Hove BN3 1RE Amount of donation (or estimate of the probable value): for myself and a member of staff, flights £2,784, accommodation £380.52, other travel costs £172, per diems £183; total £3,519.52. These costs were divided equally between both donors. Destination of visit: Uganda Date of visit: 11-14 November 2015 Purpose of visit: to visit the different organisations and charities (development) in regards to AIDS and HIV. (Registered 09 December 2015)Name of donor: Muslim Charities Forum Address of donor: 6 Whitehorse Mews, 37 Westminster Bridge Road, London SE1 7QD Amount of donation (or estimate of the probable value): for a member of staff and myself, return flights to Nairobi £5,170; one night's accommodation in Hargeisa £107.57; one night's accommodation in Borama £36.21; total £5,313.78 Destination of visit: Somaliland  Date of visit: 7-10 April 2016 Purpose of visit: to visit the different refugee camps and charities (development) in regards to the severe drought in Somaliland.  (Registered 18 May 2016)Name of donor: British-Swiss Chamber of Commerce     Address of donor: Bleicherweg, 128002, Zurich, Switzerland Amount of donation (or estimate of the probable value): flights £200.14; one night's accommodation £177, train fare Geneva to Zurich £110; total £487.14 Destination of visit: Geneva and Zurich, Switzerland Date of visit: 28-29 April 2016 Purpose of visit: to participate in a public panel discussion in Geneva in front of British-Swiss Chamber of Commerce, its members and guests. (Registered 18 May 2016) 
2. (b) Any other support not included in Category 2(a)
Name of donor: Ann Pettifor Address of donor: private Amount of donation or nature and value if donation in kind: £1,651.07 towards rent of an office for my mayoral campaign  Date received: 28 August 2015 Date accepted: 30 September 2015 Donor status: individual (Registered 08 October 2015)
1. Employment and earnings
Fees received for co-presenting BBC’s ‘This Week’ TV programme.  Address: BBC Broadcasting House, Portland Place, London W1A 1AA. (Registered 04 November 2013)14 May 2015, received £700. Hours: 3 hrs. (Registered 03 June 2015)4 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)18 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)16 July 2015, received £700. Hours: 3 hrs. (Registered 07 August 2015)8 January 2016, received £700 for an appearance on 17 December 2015. Hours: 3 hrs. (Registered 14 January 2016)28 July 2015, received £4,000 for taking part in Grant Thornton’s panel at the JLA/FD Intelligence Post-election event. Address: JLA, 14 Berners Street, London W1T 3LJ. Hours: 5 hrs. (Registered 07 August 2015)23rd October 2015, received £1,500 for co-presenting BBC’s "Have I Got News for You" TV programme. Address: Hat Trick Productions, 33 Oval Road Camden, London NW1 7EA. Hours: 5 hrs. (Registered 26 October 2015)10 October 2015, received £1,400 for taking part in a talk at the New Wolsey Theatre in Ipswich. Address: Clive Conway Productions, 32 Grove St, Oxford OX2 7JT. Hours: 5 hrs.  (Registered 26 October 2015)21 March 2016, received £4,000 via Speakers Corner (London) Ltd, Unit 31, Highbury Studios, 10 Hornsey Street, London N7 8EL, from Thompson Reuters, Canary Wharf, London E14 5EP, for speaking and consulting on a panel. Hours: 10 hrs. (Registered 06 April 2016)
Abbott, Ms Diane (Hackney North and Stoke Newington)


House of Commons



Session 2016-17

Publications on the internet

相关问题 更多 >

    热门问题