如何在Python中使用Beautiful Soup从中提取信息

2024-09-27 00:17:13 发布

您现在位置:Python中文网/ 问答频道 /正文

<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>

我需要 上传了10-29 18:50,大小4.36 GiB和NLUPPER002在两个单独的阵列中。我该怎么做?在

编辑:

这是一个html页面的一部分,它有许多具有不同值的html字体标记。我需要一个通用的解决方案,如果有的话,用汤。否则,正如建议的那样,我将研究regex。在

编辑2:

我对此有疑问。如果我们使用“class”作为键来遍历一个soup,它不会用python关键字类来类并抛出一个错误吗?在


Tags: 编辑sizebytitlehtmlclasshreffont
2条回答

查找感兴趣的元素所需的表达式取决于这些元素与文档中其他元素相比的唯一性。因此,如果没有元素的上下文,就很难提供帮助。在

您感兴趣的元素是文档中唯一的font元素并且具有detDesc类的元素吗?在

如果是这样,下面是一个使用^{}的解决方案:

import lxml.html as lh

html = '''
<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
'''

tree = lh.fromstring(html)

results = []

# iterate over all elements in the document that have a class of "detDesc"
for el in tree.xpath("//font[@class='detDesc']"):

    # extract text from the font element
    first = el.text

    # extract text from the first <a> within the font element
    second = el.xpath("a")[0].text

    results.append((first, second))

print results

结果:

^{pr2}$
soup = BeautifulSoup(your_data)
uploaded = []
link_data = []
for f in soup.findAll("font", {"class":"detDesc"}):
    uploaded.append(f.contents[0]) 
    link_data.append(f.a.contents[0])  

例如,使用以下数据:

^{pr2}$

运行上面的代码可以得到:

>>> print uploaded
[u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ']
>>> print link_data
[u'NLUPPER002', u'NLUPPER003']

要获得与前面提到的格式完全相同的文本,可以对列表进行后处理,也可以在循环本身中解析数据。例如:

>>> [",".join(x.split(",")[:2]).replace("&nbsp;", " ") for x in uploaded]
[u'Uploaded 10-29 18:50, Size 4.36 GiB', u'Uploaded 10-26 19:23, Size 1.16 GiB']

另外,如果你是列表理解的粉丝,那么解决方案可以用一句话来表达:

output = [(f.contents[0], f.a.contents[0]) for f in soup.findAll("font", {"class":"detDesc"})]

这将为您提供:

>>> output  # list of tuples
[(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'NLUPPER002'), (u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ', u'NLUPPER003')]

>>> uploaded, link_data = zip(*output)  # split into two separate lists
>>> uploaded
(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ')
>>> link_data
(u'NLUPPER002', u'NLUPPER003')

相关问题 更多 >

    热门问题