我是《华尔街日报》的付费会员,我想为NLP项目搜集一些文章。我算是个新手,但我已经花了几天时间试图找出如何使用oauth2.0,但运气不好
到目前为止,我的资源是
Scrape articles form wsj by requests, CURL and BeautifulSoup
How to log on to my wsj account from linux terminal (using curl, oauth2.0)
我试着使用上面第二个链接中的代码,但是我不断得到错误: 回溯(最近一次呼叫最后一次): 第71行,输入 打印(“连接用户:+用户名\搜索组(1)) AttributeError:“非类型”对象没有属性“组”
这个https://sso.accounts.dowjones.com/usernamepassword/login“由于某种原因,链接似乎不起作用
下面是我的代码(与第二个链接的代码相同)
import requests
from bs4 import BeautifulSoup
import re
import base64
import json
username="*********"
password="********"
base_url="https://accounts.wsj.com"
session = requests.Session()
r = session.get("{}/login".format(base_url))
soup = BeautifulSoup(r.text, "html.parser")
jscript = [
t.get("src")
for t in soup.find_all("script")
if t.get("src") is not None and "app-min" in t.get("src")
][0]
credentials_search = re.search("Base64\.decode\('(.*)'", r.text, re.IGNORECASE)
base64_decoded = base64.b64decode(credentials_search.group(1))
credentials = json.loads(base64_decoded)
print("client_id : {}".format(credentials["clientID"]))
print("state : {}".format(credentials["internalOptions"]["state"]))
print("nonce : {}".format(credentials["internalOptions"]["nonce"]))
print("scope : {}".format(credentials["internalOptions"]["scope"]))
r = session.get("{}{}".format(base_url, jscript))
connection_search = re.search('connection:\s*\"(\w+)\"', r.text, re.IGNORECASE)
connection = connection_search.group(1)
print("Testing here..........")
print(connection)
r = session.post(
'https://sso.accounts.dowjones.com/usernamepassword/login',
data = {
"username": username,
"password": password,
"connection": connection,
"client_id": credentials["clientID"],
"state": credentials["internalOptions"]["state"],
"nonce": credentials["internalOptions"]["nonce"],
"scope": credentials["internalOptions"]["scope"],
"tenant": "sso",
"response_type": "code",
"protocol": "oauth2",
"redirect_uri": "https://accounts.wsj.com/auth/sso/login"
})
soup = BeautifulSoup(r.text, "html.parser")
login_result = dict([
(t.get("name"), t.get("value"))
for t in soup.find_all('input')
if t.get("name") is not None
])
r = session.post(
'https://sso.accounts.dowjones.com/login/callback',
data = login_result)
#check connected user
r = session.get("https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y")
username_search = re.search('\"firstName\":\s*\"(\w+)\",', r.text, re.IGNORECASE)
print("connected user : " + username_search.group(1))
我还使用了与前面问题相同的资源: https://developer.dowjones.com/site/global/develop/authentication/index.gsp#2-exchanging-the-authorization-code-for-authn-tokens-98
感谢您的帮助。谢谢
目前没有回答
相关问题 更多 >
编程相关推荐