匹配JSON对象不同键的值的一种更有效的方法是查找Twitter“对话”

2024-10-05 10:41:57 发布

2003

男 | 程序猿一只，喜欢编程写python代码。

我有一个包含Twitter数据的大JSON文件，格式如下（为了便于阅读，我删除了许多Tweet属性）：

{u'user': {
    u'id': 377881302, 
    u'name': u'kenji '}, 
 u'text': u'@mochan_s2 \u4ffa\u306f\u3001',        
 u'in_reply_to_status_id_str': u'288112458336456704',  
 u'id_str':  u'288120868188602368'
}

{u'user': {
    u'id': 377881302, 
    u'name': ... }
...
}
...

每个JSON文件中都有成千上万条tweet。你知道吗

我正试图从中创建一个Twitter会话的语料库，其中来自同一个线程的所有tweet都存储在一起。你知道吗

为了做到这一点，我需要将任何有in_reply_to_status_id号的Tweet与有id_str号的原始Tweet进行匹配。你知道吗

使用jq，我提取了所有id str和in reply to id str的值，然后找到了这些值的集合，即所有出现在数据中的值都是bothid_str和in_reply_to_id_str。你知道吗

下面的python脚本将检查所有这些数字，并为每个数字在JSON文件中搜索包含该数字的对象，并将它们一起输出到一个列表中：

import gzip
import json

# Iterate over file of ids that appear both as tweet ids and reply to tweet ids
for idee in open('matching_numbers.files', 'r'):
    # remove quotes and convert to integer
    idee = int(idee[1:-2]) 
    # make list into which all Tweets from a conversation will go
    convo = [] 
    #iterate over all tweets
    with gzip.open('tweet_data.json.gz') as inputs:
        for line_no, line in enumerate(inputs):
            # use try, except to ignore incorrectly formatted lines 
            try: 
                tweet = json.loads(line) 
                if idee == tweet['id'] or idee == tweet['in_reply_to_status_id']:
                    # if we have a match, add this to the conversation list
                    convo.append(tweet)
                    print convo
            except:
                continue

这可以产生相关tweet的列表，但是速度非常慢，占用大量CPU资源，还导致存储json文件的服务器管理员发出愤怒的电子邮件。你知道吗

在matching_numbers文件中大约有9000个id号，因此需要对1.8GB文件进行9000次迭代，这是大量的数据。你知道吗

有没有更有效的方法来匹配这些数字？你知道吗

一般来说，我是编程新手，只懂一点python。我看不出有什么合乎逻辑的方法不把每个数字的整个文件都看一遍。但也许还有别的办法？你知道吗

Tags：文件 to 数据 in id json ids status

0条回答

目前没有回答

匹配JSON对象不同键的值的一种更有效的方法是查找Twitter“对话”

相关问题更多 >

编程相关推荐

热门问题

热门文章

匹配JSON对象不同键的值的一种更有效的方法是查找Twitter“对话”

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >