我的Postgres查询如何执行得更快？我可以使用Python来提供更快的迭代吗？问题的回答

我的Postgres查询如何执行得更快？我可以使用Python来提供更快的迭代吗？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

这是一个由两部分组成的问题。如果你正在检查这个，谢谢你的时间！你知道吗 <ol> <li>有没有办法让我的查询更快？你知道吗 我以前问过一个问题<a href="https://stackoverflow.com/questions/46225548/how-to-iterate-through-postgresql-jsonb-array-values-for-purposes-of-matching-wi">here</a>，最终我自己解决了这个问题。你知道吗 但是，我设计的用于生成所需结果的查询在对包含40000多条记录的数据库运行时非常慢（25分钟以上）。你知道吗 这个查询是为了达到它的目的，但是我希望你们中的一个聪明人能告诉我如何使查询以更理想的速度执行。你知道吗 我的问题： <pre><code>with dupe as ( select json_document->'Firstname'->0->'Content' as first_name, json_document->'Lastname'->0->'Content' as last_name, identifiers->'RecordID' as record_id from ( select *, jsonb_array_elements(json_document->'Identifiers') as identifiers from staging ) sub group by record_id, json_document order by last_name ) select * from dupe da where ( select count(*) from dupe db where db.record_id = da.record_id ) > 1; </code></pre> 同样，一些示例数据： 第1行： <pre><code>{ "Firstname": "Bobb", "Lastname": "Smith", "Identifiers": [ { "Content": "123", "RecordID": "123", "SystemID": "Test", "LastUpdated": "2017-09-12T02:23:30.817Z" }, { "Content": "abc", "RecordID": "abc", "SystemID": "Test", "LastUpdated": "2017-09-13T10:10:21.598Z" }, { "Content": "def", "RecordID": "def", "SystemID": "Test", "LastUpdated": "2017-09-13T10:10:21.598Z" } ] } </code></pre> 第2行： <pre><code>{ "Firstname": "Bob", "Lastname": "Smith", "Identifiers": [ { "Content": "abc", "RecordID": "abc", "SystemID": "Test", "LastUpdated": "2017-09-13T10:10:26.020Z" } ] } </code></pre></li> <li>如果我要将查询的结果或部分结果引入Python环境中，在那里可以使用Pandas对它们进行操作，那么我如何迭代查询（或子查询）的结果以获得与原始查询相同的最终结果？你知道吗 有没有一种更简单的方法，使用Python，像Postgres那样迭代我的非嵌套json数组？你知道吗 例如，执行此查询后： <pre><code>select json_document->'Firstname'->0->'Content' as first_name, json_document->'Lastname'->0->'Content' as last_name, identifiers->'RecordID' as record_id from ( select *, jsonb_array_elements(json_document->'Identifiers') as identifiers from staging ) sub order by last_name; </code></pre> 如何使用Python/Pandas获取该查询的结果并执行以下操作： <pre><code>da = datasets[query_results] # to equal my dupe da query db = datasets[query_results] # to equal my dupe db query </code></pre> 然后执行 <pre><code>select * from dupe da where ( select count(*) from dupe db where db.record_id = da.record_id ) > 1; </code></pre> 在Python中？</li> </ol> 如果我没有在这里提供足够的信息，我很抱歉。我是一个Python新手。非常感谢您的帮助！谢谢！！你知道吗

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

考虑读取postgresjson列类型的原始、未查询的值，并使用pandas<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html" rel="nofollow noreferrer">^{<cd1>}</a>绑定到平面数据帧中。从那里使用熊猫<a href="http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html" rel="nofollow noreferrer">^{<cd2>}</a>。你知道吗 为了演示，下面为每个对应的标识符记录将一个json数据解析为三行数据帧： <pre><code>import json import pandas as pd json_str = ''' { "Firstname": "Bobb", "Lastname": "Smith", "Identifiers": [ { "Content": "123", "RecordID": "123", "SystemID": "Test", "LastUpdated": "2017-09-12T02:23:30.817Z" }, { "Content": "abc", "RecordID": "abc", "SystemID": "Test", "LastUpdated": "2017-09-13T10:10:21.598Z" }, { "Content": "def", "RecordID": "def", "SystemID": "Test", "LastUpdated": "2017-09-13T10:10:21.598Z" } ] } ''' data = json.loads(json_str) df = pd.io.json.json_normalize(data, 'Identifiers', ['Firstname','Lastname']) print(df) # Content LastUpdated RecordID SystemID Lastname Firstname # 0 123 2017-09-12T02:23:30.817Z 123 Test Smith Bobb # 1 abc 2017-09-13T10:10:21.598Z abc Test Smith Bobb # 2 def 2017-09-13T10:10:21.598Z def Test Smith Bobb </code></pre> <hr/> 对于您的数据库，请考虑连接DB-API，例如<code>psycopg2</code>或<a href="http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html" rel="nofollow noreferrer">sqlAlchemy</a>，并相应地将每个json解析为一个字符串。诚然，可能还有其他方法来处理json，如<a href="http://initd.org/psycopg/docs/extras.html#additional-data-types" rel="nofollow noreferrer">psycopg2 docs</a>中所示，但下面将以文本形式接收数据并在python端进行解析： <pre><code>import psycopg2 conn = psycopg2.connect("dbname=test user=postgres") cur = conn.cursor() cur.execute("SELECT json_document::text FROM staging;") df = pd.io.json.json_normalize([json.loads(row[0]) for row in cur.fetchall()], 'Identifiers', ['Firstname','Lastname']) df = df.drop_duplicates(['RecordID']) cur.close() conn.close() </code></pre>

我的Postgres查询如何执行得更快？我可以使用Python来提供更快的迭代吗？

1 个回答

相关Python问题