Python爬网表元素

import urllib2 from bs4 import BeautifulSoup htmla = urllib2.urlopen('http://www.basketball-reference.com/teams/CHO/2017.html') bsObja=BeautifulSoup(htmla,"html.parser") tables = bsObja.find_all("table")

3条回答

网友

1楼 · 编辑于 2024-09-30 08:34:17

BS中Comment对象中的数据，而Comment对象只是NavigableString的一种特殊类型，您需要做的是：

找到包含信息的刺
使用BeautifulSoup将字符串转换为BS object
从BS object提取数据

代码：

import re
table_string = soup.find(string=re.compile('div_team_misc'))

这将返回包含表html代码的sting。在

^{pr2}$

使用sting构造BS对象，并从对象中提取数据

for tr in table.find_all('tr', class_=False):
    s = [td.string for td in tr('td')]
    print(s)

退出：

['17', '13', '2.17', '-0.51', '1.66', '106.9', '104.7', '96.5', '.300', '.319', '.493', '10.9', '20.5', '.228', '.501', '11.6', '79.6', '.148', 'Spectrum Center', '269,471']
['10', '9', '8', '24', '10', '17', '5', '15', '4', '11', '22', '1', '27', '5', '12', '28', '3', '1', None, '15']

更多评论：

markup = "<b><! Hey, buddy. Want to buy a used parser? ></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string

注释对象只是一种特殊类型的NavigableString，BS会从中提取字符串，我们不需要更改或替换任何html。在

comment
# u'Hey, buddy. Want to buy a used parser'

基于此，我们可以使用纯BS而不是re来提取注释

table_string = soup.find(id="all_team_misc").contents[-2]

如果要查找所有表字符串，可以执行以下操作：

from bs4 import Commnet
tables = soup.find_all(string=lambda text:isinstance(text,Comment) and str(text).startswith('  \n'))

网友

2楼 · 编辑于 2024-09-30 08:34:17

我想你想

tables = bsObja.findAll("table")

网友

3楼 · 编辑于 2024-09-30 08:34:17

这个页面将所有的表隐藏在注释中，JavaScript使用它来显示表，并可能在显示前进行排序或过滤。在

所有注释都在<div class='placeholder'>之后，因此您可以使用它来查找此注释，从注释中获取所有文本并使用BS来解析它。在

#!/usr/bin/env python3

#import urllib.request
import requests
from bs4 import BeautifulSoup as BS

url = 'http://www.basketball-reference.com/teams/CHO/2017.html'

#html = urllib.request.urlopen(url)
html = requests.get(url).text

soup = BS(html, 'html.parser')

placeholders = soup.find_all('div', {'class': 'placeholder'})

total_tables = 0

for x in placeholders:
    # get elements after placeholder and join in one string
    comment = ''.join(x.next_siblings)

    # parse comment
    soup_comment = BS(comment, 'html.parser')

    # search table in comment
    tables = soup_comment.find_all('table')

    # ... do something with table ...

    #print(tables)

    total_tables += len(tables)

print('total tables:', total_tables)

这样我发现了11个隐藏在注释中的表。在

相关问题更多 >

编程相关推荐

热门问题

热门文章