如何限制regex结果？

{"contributors": null, "truncated": false, "text": "RT @BelloPromotions: Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica #musicanu\u2026", "is_quote_status": false, "in_reply_to_status_id": null, "id": 1099558111000506369, "favorite_count": 0, "entities": {"symbols": [], "user_mentions": [{"id": 943461023293542400, "indices": [3, 19], "id_str": "943461023293542400", "screen_name": "BelloPromotions", "name": "Bello Promotions \ud83d\udcc8\ud83d\udcb0"}, {"id": 729572008909000704, "indices": [60, 71], "id_str": "729572008909000704", "screen_name": "MykeTowers", "name": "Towers Myke"}, {"id": 775866464, "indices": [92, 99], "id_str": "775866464", "screen_name": "mariah", "name": "Kenzie peretti"}], "hashtags": [{"indices": [72, 83], "text": "myketowers"}, {"indices": [84, 91], "text": "mariah"}, {"indices": [100, 114], "text": "Desaparecemos"}, {"indices": [115, 121], "text": "music"}, {"indices": [122, 129], "text": "musica"}], "urls": []}, "retweeted": false, "coordinates": null, "source": "<a href=\"http://twitter-dummy-auth.herokuapp.com/\" rel=\"nofollow\">Music Twr Suggesting</a>", "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 18, "id_str": "1099558111000506369", "favorited": false, "retweeted_status": {"contributors": null, "truncated": true, "text": "Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]", .......

import re import codecs err_occur = [] pattern = re.compile(r'(?:"contributors": .*?, "truncated": .*?, "text": ")([^R][^T].*?)"') input_filename = 'music_fixed.json' tweets = open("tweets_380k.txt", "w") try: with codecs.open ('music_fixed.json', encoding='utf8') as in_file: for line in in_file: matches = pattern.findall(line) if matches: for match in matches: err_occur.append(match) except FileNotFoundError: print("Input file %r not found." % input_filename) for tagged in err_occur: tweets.write(str(tagged)+"\n")

2条回答

网友

1楼 · 编辑于 2024-10-03 21:25:28

正如其他人在评论中所说的，您可能应该使用JSON解析器并从中获取它。你知道吗

然而，如果您的输入不是JSON（或者一次将其全部拉入内存是不可行的），那么您应该对regex做一些调整。你知道吗

首先（同样，正如其他人已经指出的那样），.*?只是“非贪婪”的，因为它将找到最短的匹配；如果有匹配，它仍然会找到匹配。我猜你可以把这个修到

(?:[^"\\]+\\.)*)[^"\\]*

只获取不包含未转义双引号的字符串。你知道吗

其次，我猜您希望[^R][^T]跳过一开始就包含RT的匹配；但这不是它的意思。它将跳过不带R的字符后跟不带T的字符的匹配。因此它也不会匹配AT或Re！你知道吗

在Python（通常与PCRE兼容）regex中，表示“must not match”的方式是一种负的lookahead (?!RT)。你知道吗

把这些放在一起，试试看

pattern = re.compile(r'(?:"contributors": "(?:[^"\\]+\\.)*)[^"\\]*",'
    r' "truncated": "(?:[^"\\]+\\.)*)[^"\\]*",'
    r' "text": ")((?!RT)(?:[^"\\]+\\.)*)[^"\\]*)"')

请理解，我不得不猜测或阅读的字里行间在这里的几个地方。如果你能更新你的问题来解释你的数据到底是什么样子的，以及你希望逻辑应该如何工作，那么这可能会得到改进，或者至少可以做一些调整，以达到你真正想要的效果。你知道吗

网友

2楼 · 编辑于 2024-10-03 21:25:28

How to limit regex results?

在我简单回答这个问题之前，我应该澄清一下为什么现在的表达式会产生一个不想要的结果：在子表达式(?:"contributors": .*?, "truncated": .*?, "text": ")中，最后的.*?，尽管它不是贪婪的，却匹配所有的输入

false, "text": "RT @BelloPromotions: Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica #musicanu\u2026", "is_quote_status": false, "in_reply_to_status_id": null, "id": 1099558111000506369, "favorite_count": 0, "entities": {"symbols": [], "user_mentions": [{"id": 943461023293542400, "indices": [3, 19], "id_str": "943461023293542400", "screen_name": "BelloPromotions", "name": "Bello Promotions \ud83d\udcc8\ud83d\udcb0"}, {"id": 729572008909000704, "indices": [60, 71], "id_str": "729572008909000704", "screen_name": "MykeTowers", "name": "Towers Myke"}, {"id": 775866464, "indices": [92, 99], "id_str": "775866464", "screen_name": "mariah", "name": "Kenzie peretti"}], "hashtags": [{"indices": [72, 83]

也就是说，从第一个"truncated":到下一个, "text":之间的所有事物都不被后面的"RT…"排除，那就是在不需要的"myketowers"之前的事物。你知道吗

因此，为了阻止表达式匹配所有的输入，我们不能简单地允许每个字符（.）都在"truncated":和, "text":之间，而是只允许那些构成可能值false和true的字符，或者为了简单起见，只允许单词字符（\w）；因此，将上述子表达式更改为(?:"contributors": .*?, "truncated": \w*, "text": ")就足够了。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章