靓汤Python汤

2024-10-03 21:27:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我对Python很陌生。stackoverflow的长期用户,但第一次发布问题。 我试图提取数据从网站使用beautifulsoup。 Sample Code where I want to extract is (listed in and tagged in data)

在列表中可以提取,但我无法提取实际数据。 这里的目标是提取 列在:指甲油订阅箱、美容产品订阅箱、女性订阅箱 加入:化妆品、美容、指甲油

你能告诉我怎么做吗。你知道吗

import requests
from bs4 import BeautifulSoup

l1=[]
url='http://boxes.mysubscriptionaddiction.com/box/julep-maven'
source_code=requests.get(url)
plain_text=source_code.text
soup= BeautifulSoup(plain_text,"lxml")
for item in soup.find_all('p'):    
       l1.append(item.contents)
search='\nListed in:\n'
for a in l1:
    if a[0] in ('\nTagged in:\n','\nListed in:\n'):
        print(a)

Tags: 数据textinimporturll1sourcefor
2条回答
soup = BeautifulSoup(plain_text, 'html.parser')

import re
context = soup(text=re.compile(r'Listed in:'))

for item in context:
    listed_in = item.parent
    tagged_in = listed_in.find_next_siblings()[0]

print(listed_in.text.strip('\n').replace('\n', ''))
print(tagged_in.text.strip('\n').replace('\n', ''))

将全部显示在一行中:

Listed in:Nail Polish Subscription Boxes, Subscription Boxes for Beauty Products, Subscription Boxes for Women, Tagged in: Makeup, Beauty, Nail polish

希望有帮助。你知道吗

既然您正在使用lxml,为什么不以更直接的方式使用它(人们认为lxmlBeautifulSoup快):

import requests
from lxml import html

url='http://boxes.mysubscriptionaddiction.com/box/julep-maven'
source_code=requests.get(url)
tree = html.fromstring(source_code.content) #parses the html
paras = tree.xpath('//div[@class="box-information"]/p') #gets the para elements

# This loop prints the desired para elements' text.
for ele in paras[1:]:
    print(ele.text_content())

输出:

Listed in:
Nail Polish Subscription Boxes, Subscription Boxes for Beauty Products, Subscription Boxes for Women


Tagged in:
Makeup, Beauty, Nail polish

注意:该站点受captcha保护,因此您可能需要将源html作为字符串从浏览器的开发工具中复制出来,并在tree = html.fromstring(copied_string)中使用它以使此代码正常工作。你知道吗

相关问题 更多 >