如何限制regex结果？问题的回答

如何限制regex结果？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我试图从一个巨大的JSON文件中提取tweets，我的regex生成了太多的数据，我一辈子都不知道如何限制它。正则表达式找到了它的本意，但它也标记了太多。你知道吗 我使用的正则表达式如下（可能比需要的复杂，但这不是我感兴趣的）： <pre><code>(?:"contributors": .*?, "truncated": .*?, "text": ")([^R][^T].*?)" </code></pre> 以下是JSON文件中生成过多数据的截断行，例如： <pre><code>{"contributors": null, "truncated": false, "text": "RT @BelloPromotions: Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica #musicanu\u2026", "is_quote_status": false, "in_reply_to_status_id": null, "id": 1099558111000506369, "favorite_count": 0, "entities": {"symbols": [], "user_mentions": [{"id": 943461023293542400, "indices": [3, 19], "id_str": "943461023293542400", "screen_name": "BelloPromotions", "name": "Bello Promotions \ud83d\udcc8\ud83d\udcb0"}, {"id": 729572008909000704, "indices": [60, 71], "id_str": "729572008909000704", "screen_name": "MykeTowers", "name": "Towers Myke"}, {"id": 775866464, "indices": [92, 99], "id_str": "775866464", "screen_name": "mariah", "name": "Kenzie peretti"}], "hashtags": [{"indices": [72, 83], "text": "myketowers"}, {"indices": [84, 91], "text": "mariah"}, {"indices": [100, 114], "text": "Desaparecemos"}, {"indices": [115, 121], "text": "music"}, {"indices": [122, 129], "text": "musica"}], "urls": []}, "retweeted": false, "coordinates": null, "source": "<a href=\"http://twitter-dummy-auth.herokuapp.com/\" rel=\"nofollow\">Music Twr Suggesting</a>", "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 18, "id_str": "1099558111000506369", "favorited": false, "retweeted_status": {"contributors": null, "truncated": true, "text": "Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]", ....... </code></pre> 在上面的例子中，我的regex打印出“myketowers”，然后是tweet的第二个实例（原始tweet——在“retweeted\u status”之后）。我想要的只是推特。你知道吗 下面是我正在运行的Python代码（它没有抛出任何错误，而且它完全按照我的要求执行，只是太多了）： <pre><code>import re import codecs err_occur = [] pattern = re.compile(r'(?:"contributors": .*?, "truncated": .*?, "text": ")([^R][^T].*?)"') input_filename = 'music_fixed.json' tweets = open("tweets_380k.txt", "w") try: with codecs.open ('music_fixed.json', encoding='utf8') as in_file: for line in in_file: matches = pattern.findall(line) if matches: for match in matches: err_occur.append(match) except FileNotFoundError: print("Input file %r not found." % input_filename) for tagged in err_occur: tweets.write(str(tagged)+"\n") </code></pre> 如上所述，发布的JSON行的regex的预期输出是： <pre><code>Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link] </code></pre> 最终写入我的文本文件的是： <pre><code>myketowers Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link] </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如何限制regex结果？

1 个回答

相关Python问题