Python psaw包_程序模块 - PyPI

用于reddit.com公共评论/提交搜索的pushshift.io api包装

psaw的Python项目详细描述

安装

pip install psaw

目前，只支持python 3。

说明

通过pushshift.io api搜索公共reddit评论/提交的极简包装。

pushshift是一个非常有用的资源，但是api的文档记录很差。因此，这个api包装器当前的设计是为了让用户可以很容易地传递任何想要尝试的搜索参数。

尽管它不一定反映api的当前状态，但是您应该尝试熟悉pushshift api文档以更好地理解什么样的搜索论据可能奏效。

功能

处理速率限制和指数退避，以最大重试和最大退避限制。最低速率限制为每秒1个请求根据与Pushshift维护人员的协商，默认情况下， /u/Stuck_in_the_matrix。
处理结果分页。默认情况下，返回给定查询的所有历史结果。
从pushshift获取id后，可以选择处理praw的合并以获取对象
如果不使用praw，则返回comment和submission对象的结果 api类似于对应的praw对象。另外，结果对象有一个附加的.d_属性，提供对相关数据属性的dict访问。
可选地添加created属性，该属性转换注释/提交的created_utc 用户本地时间的时间戳。（可能会为具有特定时区的用户引发异常设置）。
向api传递查询参数的简单接口。API的文档很少，因此，尝试一个论点，看看它是否有效，往往是卓有成效的。
一个stop_condition参数，使在给定任意用户定义的条件下停止生成结果变得简单

警告

使用非默认排序可能会导致意外行为。
默认行为是连续命中pushshift api。如果查询正在进行返回结果的时间比预期的要长，PSAW可能会提取更多的数据比你想要的或是陷入某种圈套。
我强烈建议通过打印到stdout来进行原型查询，以确保获得期望的行为。

演示用法

frompsawimportPushshiftAPIapi=PushshiftAPI()

或者使用pushshift搜索获取id，然后使用praw获取对象：

importprawfrompsawimportPushshiftAPIr=praw.Reddit(...)api=PushshiftAPI(r)

100份最新提交

# The `search_comments` and `search_submissions` methods return generator objectsgen=api.search_submissions(limit=100)results=list(gen)

2017年前10次提交给/r/politics，过滤结果到url/author/title/subreddit字段。

将自动添加created_utc字段（用于分页）。

importdatetimeasdtstart_epoch=int(dt.datetime(2017,1,1).timestamp())list(api.search_submissions(after=start_epoch,subreddit='politics',filter=['url','author','title','subreddit'],limit=10))

尝试一个实际上不起作用的搜索参数

根据pushshift.io api文档，我们应该能够通过url搜索提交的内容，但是（在撰写本文的时候）这实际上在实践中并不起作用。 api仍然应该尊重limit参数和其他可能支持的参数，但没有保证。如果您发现API不支持您传递的参数，最好的方法是将其从查询中移除，并修改api调用以仅利用支持减轻意外行为风险的论据。

url='http://www.politico.com/story/2017/02/mike-flynn-russia-ties-investigation-235272'url_results=list(api.search_submissions(url=url,limit=500))len(url_results),any(r.url==urlforrinurl_results)# 500, False

所有包含文本“op”

的askreddit注释

使用q参数搜索文本。省略limit参数会导致历史搜索。请求按 max_results_per_request参数（默认值=500）。省略“最大响应缓存” 下面演示中的测试将返回所有结果。否则，此演示将执行两个每个api请求返回500条评论。或者，可以查询生成器以获取其他结果。

gen=api.search_comments(q='OP',subreddit='askreddit')max_response_cache=1000cache=[]forcingen:cache.append(c)# Omit this test to actually return all results. Wouldn't recommend it though: could take a while, but you do you.iflen(cache)>=max_response_cache:break# If you really want to: pick up where we left off to get the rest of the results.ifFalse:forcingen:cache.append(c)

使用`aggs`参数总结搜索结果

当向搜索方法提供aggs参数时，生成器生成的第一个结果将包含aggs结果。

api=PushshiftAPI()gen=api.search_comments(author='nasa',aggs='subreddit')next(gen)#  {'subreddit': [#    {'doc_count': 300, 'key': 'IAmA'},#    {'doc_count': 6, 'key': 'space'},#    {'doc_count': 1, 'key': 'ExposurePorn'},#    {'doc_count': 1, 'key': 'Mars'},#    {'doc_count': 1, 'key': 'OldSchoolCool'},#    {'doc_count': 1, 'key': 'news'},#    {'doc_count': 1, 'key': 'pics'},#    {'doc_count': 1, 'key': 'reddit.com'}]}len(list(gen))# 312

使用`redditor_subreddit_activity`便利方法

如果要像aggs示例中那样分析redditors活动，则 redditor_subreddit_activity提供了一个简单的速记法，用于通过子reddits分析用户其中他们是活动的，在一次通话中分别计算评论和提交的内容，以及分别返回用于评论和发布活动的计数器对象。

api = PushshiftAPI() result = api.redditor_subreddit_activity(‘nasa’) result #{‘comment’: # Counter({ # ‘ExposurePorn’: 1, # ‘IAmA’: 300, # ‘Mars’: 1, # ‘OldSchoolCool’: 1, # ‘news’: 1, # ‘pics’: 1, # ‘reddit.com’: 1, # ‘space’: 6}), # ‘submission’: # Counter({ # ‘IAmA’: 3, # ‘ISS’: 1, # ‘Mars’: 1, # ‘space’: 3, # ‘u_nasa’: 86})}

使用`stop_condition`参数获取bot帐户最近的提交

gen=api.search_submissions(stop_condition=lambdax:'bot'inx.author)forsubmingen:passprint(subm.author)

许可证

psaw的来源是在Simplified BSD License下提供的。

欢迎加入QQ群-->： 979659372

psaw 0.0.7

psaw的Python项目详细描述

安装
pip install psaw
目前，只支持python 3。

说明

功能

警告

演示用法

100份最新提交

2017年前10次提交给/r/politics，过滤结果到url/author/title/subreddit字段。

尝试一个实际上不起作用的搜索参数

所有包含文本“op”

使用`aggs`参数总结搜索结果

使用`redditor_subreddit_activity`便利方法

使用`stop_condition`参数获取bot帐户最近的提交

许可证

推荐PyPI第三方库

distributions-adeola

twobits

flexibox

debug-worldaaa

pywdrwetter

nim4p

philter-lite

gooseextractor

naparipluginengine

flask-unittest

inveniobase

marshmallowannotations

django-3-jet-zupit

thoth-package-extract

pythondaemon3k

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

psaw 0.0.7

psaw的Python项目详细描述

安装 pip install psaw 目前，只支持python 3。

说明

功能

警告

演示用法

100份最新提交

2017年前10次提交给/r/politics，过滤结果到url/author/title/subreddit字段。

尝试一个实际上不起作用的搜索参数

所有包含文本“op”

使用aggs参数总结搜索结果

使用redditor_subreddit_activity便利方法

使用stop_condition参数获取bot帐户最近的提交

许可证

推荐PyPI第三方库

distributions-adeola

twobits

flexibox

debug-worldaaa

pywdrwetter

nim4p

philter-lite

gooseextractor

naparipluginengine

flask-unittest

inveniobase

marshmallowannotations

django-3-jet-zupit

thoth-package-extract

pythondaemon3k

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

安装
pip install psaw
目前，只支持python 3。

使用`aggs`参数总结搜索结果

使用`redditor_subreddit_activity`便利方法

使用`stop_condition`参数获取bot帐户最近的提交

导航栏

项目链接

标签