如何通过python脚本(oauth2.0)登录到我的wsj帐户

2024-09-27 00:20:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我是《华尔街日报》的付费会员,我想为NLP项目搜集一些文章。我算是个新手,但我已经花了几天时间试图找出如何使用oauth2.0,但运气不好

到目前为止,我的资源是

Scrape articles form wsj by requests, CURL and BeautifulSoup

How to log on to my wsj account from linux terminal (using curl, oauth2.0)

我试着使用上面第二个链接中的代码,但是我不断得到错误: 回溯(最近一次呼叫最后一次): 第71行,输入 打印(“连接用户:+用户名\搜索组(1)) AttributeError:“非类型”对象没有属性“组”

这个https://sso.accounts.dowjones.com/usernamepassword/login“由于某种原因,链接似乎不起作用

下面是我的代码(与第二个链接的代码相同)

import requests
from bs4 import BeautifulSoup
import re
import base64
import json

username="*********"
password="********"
base_url="https://accounts.wsj.com"

session = requests.Session()
r = session.get("{}/login".format(base_url))
soup = BeautifulSoup(r.text, "html.parser")
jscript = [
    t.get("src")
    for t in soup.find_all("script")
    if t.get("src") is not None and "app-min" in t.get("src")
][0]

credentials_search = re.search("Base64\.decode\('(.*)'", r.text, re.IGNORECASE)
base64_decoded = base64.b64decode(credentials_search.group(1))
credentials = json.loads(base64_decoded)

print("client_id : {}".format(credentials["clientID"]))
print("state     : {}".format(credentials["internalOptions"]["state"]))
print("nonce     : {}".format(credentials["internalOptions"]["nonce"]))
print("scope     : {}".format(credentials["internalOptions"]["scope"]))

r = session.get("{}{}".format(base_url, jscript))

connection_search = re.search('connection:\s*\"(\w+)\"', r.text, re.IGNORECASE)
connection = connection_search.group(1)
print("Testing here..........")
print(connection)

r = session.post(
    'https://sso.accounts.dowjones.com/usernamepassword/login',
    data = {
        "username": username,
        "password": password,
        "connection": connection,
        "client_id": credentials["clientID"],
        "state": credentials["internalOptions"]["state"],
        "nonce": credentials["internalOptions"]["nonce"],
        "scope": credentials["internalOptions"]["scope"],
        "tenant": "sso",
        "response_type": "code",
        "protocol": "oauth2",
        "redirect_uri": "https://accounts.wsj.com/auth/sso/login"
    })
soup = BeautifulSoup(r.text, "html.parser")

login_result = dict([
    (t.get("name"), t.get("value"))
    for t in soup.find_all('input')
    if t.get("name") is not None
])

r = session.post(
    'https://sso.accounts.dowjones.com/login/callback',
    data = login_result)

#check connected user
r = session.get("https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y")
username_search = re.search('\"firstName\":\s*\"(\w+)\",', r.text, re.IGNORECASE)
print("connected user : " + username_search.group(1))

我还使用了与前面问题相同的资源: https://developer.dowjones.com/site/global/develop/authentication/index.gsp#2-exchanging-the-authorization-code-for-authn-tokens-98

https://oauth.net/2/

感谢您的帮助。谢谢


Tags: httpsrecomformatsearchgetsessionlogin

热门问题