无法从中提取数据pantip.com网站

import requests import re from bs4 import BeautifulSoup # specify the url url = 'https://pantip.com/topic/38372443' # Split Topic number topic_number = re.split('https://pantip.com/topic/', url) topic_number = topic_number[1] page = requests.get(url) soup = BeautifulSoup(page.content, 'html.parser') # Capture title elementTag_title = soup.find(id = 'topic-'+ topic_number) title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string) # Capture post story resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0] post = resultSet_post.contents[1].text.strip()

<div id="comments-jsrender"> <div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)"> <span class="icon-expand-left"><small>▼</small></span> <span class="focus- txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span class="icon-expand-right"><small>▼</small></span> </a> </div> </div>

1条回答

网友

1楼 · 发布于 2024-09-29 19:32:22

您在定位这些帖子的其余部分时遇到困难的原因是站点中填充了动态javascript。为了解决这个问题，您可以使用selenium实现一个解决方案，请参阅下面的如何获取正确的驱动程序并将其添加到系统变量https://github.com/mozilla/geckodriver/releases。Selenium将加载页面，您将可以完全访问您在屏幕截图中看到的所有属性，只需输入数据就可以了。你知道吗

完成此操作后，可以使用以下命令返回每个POST数据：

from bs4 import BeautifulSoup
from selenium import webdriver

url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')

for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
    if len(str(div.text).strip()) > 1:
        print(str(div.text).strip())

driver.quit()

相关问题更多 >

编程相关推荐

热门问题

热门文章