我需要使用bs4在 标记之间刮取数据 - 问答

<blockquote> ICM Partners 730 Fifth Avenue New York, NY 10019 (212) 556-5600 (Gelfman Schneider) <a href="http://www.icmtalent.com"target="_blank">http://www.icmtalent.com</a> </blockquote>

<blockquote> The Agency 24 Pottery Lane Holland Park London W11 4LZ <a href="http://theagency.co.uk" target="_blank">http://theagency.co.uk</a> </blockquote>

1条回答

网友
1楼 · 发布于 2024-09-30 18:13:57

提取此类数据可能非常容易出错，需要在更大的数据集上进行测试
一种可能的办法是：
使用.stripped_strings拆分整个条目，以提供可能行的列表
使用普通快车尝试查找包含电话号码的线路。如果未找到，请将地址的结尾设置为除最后一行之外的所有行
通过假设第一个条目是公司名称来创建条目，以下条目是一个最长为address_end的地址，如果找到，则可选电话条目
假设最后一个条目是网站
例如：
from bs4 import BeautifulSoup import re re_tel = re.compile(r'[0-9() -]{5,}$') html = """ <blockquote> ICM Partners 730 Fifth Avenue New York, NY 10019 (212) 556-5600 (Gelfman Schneider) <a href="http://www.icmtalent.com"target="_blank">http://www.icmtalent.com</a> </blockquote> <blockquote> The Agency 24 Pottery Lane Holland Park London W11 4LZ <a href="http://theagency.co.uk" target="_blank">http://theagency.co.uk</a> </blockquote> """ soup = BeautifulSoup(html, "html.parser") for blockquote in soup.find_all('blockquote'): fields = list(blockquote.stripped_strings) tel = '' address_end = -1 for index, field in enumerate(fields): if re_tel.match(field): tel = field address_end = index break fields = [fields[0], ', '.join(fields[1:address_end]), tel, fields[-1]] print(fields)
对于您的两个示例，这将给出：
['ICM Partners', '730 Fifth Avenue, New York, NY 10019', '(212) 556-5600', 'http://www.icmtalent.com'] ['The Agency', '24 Pottery Lane, Holland Park, London W11 4LZ', '', 'http://theagency.co.uk']
当在更大的数据集上测试时，这无疑需要改进

我需要使用bs4在<Br>标记之间刮取数据

相关问题更多 >

编程相关推荐

热门问题

热门文章

我需要使用bs4在<Br>标记之间刮取数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >