我是python新手,尝试HTML解析器已经成功了这么久,但是我一直在研究如何为页面底部的评论分页,以便为站点工作。你知道吗
该网址是在PasteBin代码中,出于隐私原因,我在这个线程中省略了该网址。你知道吗
非常感谢您的帮助。你知道吗
# Reviews Scrape
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'EXAMPLE.COM'
# opening up connection, grabbing, the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# HTML Parsing
page_soup = soup(page_html, "html.parser")
# Grabs each review
reviews = page_soup.findAll("div",{"class":"jdgm-rev jdgm-divider-top"})
filename = "compreviews.csv"
f = open(filename, "w")
headers = "Score, Title, Content\n"
f.write(headers)
# HTML Lookup Location per website and strips spacing
for container in reviews:
# score = container.div.div.span["data-score"]
score = container.findAll("span",{"data-score":True})
user_score = score[0].text.strip()
title_review = container.findAll("b",{"class":"jdgm-rev__title"})
user_title = title_review[0].text.strip()
content_review = container.findAll("div",{"class":"jdgm-rev__body"})
user_content = content_review[0].text.strip()
print("user_score:" + score[0]['data-score'])
print("user_title:" + user_title)
print("user_content:" + user_content)
f.write(score[0]['data-score'] + "," +user_title + "," +user_content + "\n")
f.close()
该页使用查询字符串执行xhr GET请求以获取结果。此查询字符串具有每页评论和页码的参数。您可以用每页的最大评论数31来发出一个初始请求,从返回的json中提取html,然后获取页面计数;编写一个循环来运行所有页面以获得结果。构造示例如下:
示例数据帧到csv
相关问题 更多 >
编程相关推荐