漂亮的汤在某些情况下会变淡，但在另一些情况下则不然。为什么？

import requests, json from bs4 import BeautifulSoup s = requests.session() def get_csrf_tokens(): url = "https://www.linkedin.com/" req = s.get(url).text csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0] login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0] return csrf_token, login_csrf_token def login(username, password): url = "https://www.linkedin.com/uas/login-submit" csrfToken, loginCsrfParam = get_csrf_tokens() data = { 'session_key': username, 'session_password': password, 'csrfToken': csrfToken, 'loginCsrfParam': loginCsrfParam } req = s.post(url, data=data) print "success" login(USERNAME PASSWORD) def get_all_json(company_link): r=s.get(company_link) html= r.content soup=BeautifulSoup(html) html_file= open("html_file.html", 'w') html_file.write(html) html_file.close() Json_stuff=soup.find('code', id="voltron_srp_main-content") print Json_stuff return remove_tags(Json_stuff) def remove_tags(p): p=str(p) return p[62: -10] def list_of_employes(): jsons=get_all_json('https://www.linkedin.com/vsearch/p?f_CC=2409087') print jsons loaded_json=json.loads(jsons.replace(r'\u002d', '-')) employes=loaded_json['content']['page']['voltron_unified_search_json']['search']['results'] return employes def get_employee_link(employes): profiles=[] for employee in employes: print employee['person']['link_nprofile_view_3'] profiles.append(employee['person']['link_nprofile_view_3']) return profiles , len(profiles) print get_employee_link(list_of_employes())

2条回答

网友

1楼 · 编辑于 2024-09-27 21:34:32

这是因为结果是分页的。您需要在以下位置获取json数据中定义的所有页面：

data['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']

pages是一个列表，对于公司2409087它是：

^{pr2}$

这基本上是一个URL列表，您需要通过它来获取数据。在

以下是您需要执行的操作（为登录编辑代码）：

def get_results(json_code):
    return json_code['content']['page']['voltron_unified_search_json']['search']['results']

url = "https://www.linkedin.com/vsearch/p?f_CC=2409087"
soup = BeautifulSoup(s.get(url).text)

code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)

results = get_results(json_code)

pages = json_code['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
for page in pages[1:]:
    soup = BeautifulSoup(s.get(page['pageURL']).text)
    code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
    json_code = json.loads(code)
    results += get_results(json_code)

print len(results)

它为https://www.linkedin.com/vsearch/p?f_CC=2409087打印25这正是您在浏览器中看到的内容。在

网友

2楼 · 编辑于 2024-09-27 21:34:32

结果是默认的beauthoulsoup解析器出了问题。我把它改成了html5lib：

安装在控制台中

pip install html5lib

并更改第一次创建soup对象时选择的解析器类型。在

^{pr2}$

这在BeautifulSoup docs here中有记录

相关问题更多 >

编程相关推荐

热门问题

热门文章