使用Python使用re.match()或split()提取JSON字符串中的URL

2024-10-06 06:58:29 发布

您现在位置:Python中文网/ 问答频道 /正文

使用Python代码,我提取JSON文件的一个特殊部分(列表中的列表或字典的一部分):

import json
import urllib

f = open('json-test-file-for-insta-url-snippet.json')
data = json.load(f)

print(json.dumps(data["event"]["attachments"][0]["text"]))

我得到了这个结果:

"\u201cUNLIMITED LIVE\u201d world tour moved to 2021!\nDue to the Covid-19 pandemic and the subsequent regulations and concert restrictions, the world tour, originally planned for the autumn of 2020, could not take place. \n\"\u201eI was very much looking forward to our tour in autumn 2020 all over the world, so I\u2019m deeply sorry that these concerts had to be rescheduled due to the Covid-19 pandemic. I\u2019m very happy that we have already found new dates for our tour in autumn 2021, because I cannot wait to return to get back on stage and to play for you guys. Take care of yourselves \u2013 I hope to see you all happy and healthy again very, very soon!\u201d \nAll your tickets remain valid for the new dates! Please find them below: \n\nKAZ Almaty - Sep 11, 2021\nRUS Yekaterinburg - Sep 14, 2021\nRUS Kazan, Sep 16, 2021\nRUS Voronezh - Sep 18, 2021\nRUS Krasnodar - Sep 20, 2021\nRUS Moscow - Sep 22, 2021\nRUS St. Petersburg - Sep 24, 2021\nUKR Kharkiv - Sep 26 2021\nUKR Odessa - Sep 28, 2021\nUKR Kiev - Sep 30, 2021\nITA Bolzano - Oct 13, 2021\nITA Bologna - Oct 15, 2021\nITA Genoa - Oct 16, 2021\nITA Milano - Oct 17, 2021\nITA Conegliano Veneto - Oct 19, 2021\nBG Sofia - Oct 24, 2021\nRO Bucharest - Oct 26, 2021\nRO Cluj - Oct 29, 2021  #davidgarrett #tour2021 #unlimited #live #postponed\n*Score* -2.57x | *Likes* 338 (-830) | *Comments* 13 (-46)\n_Posted on Tuesday, August 18 at 9:59 AM CEST <https://www.instagram.com/p/CEBew-xHwhJ/|(Instagram)>_\n_Received via Viral Alert_"

现在我想在最后提取Insta URL—我如何在Python中实现它?是否只有正则表达式才有可能,还是有更聪明的方法?我在Stackoverflow上读了很多书,但没有一本对我有用。请帮忙


Tags: andthetojson列表forworldoct
3条回答

您可以使用以下正则表达式从文本中提取Instagram链接:

<(.+)\|\(Instagram\)>

See here

它搜索由<|(Instagram)>包装的任何文本,并将其存储在捕获组中


你可以这样使用它

import re

INSTA_LINK_RE = re.compile(r'<(.+)\|\(Instagram\)>')

match = INSTA_LINK_RE.search(json.dumps(data["event"]["attachments"][0]["text"]))

if match:
    url = match[1]  # gets the first capturing group

如果您只想获取短代码,请使用this regex

<https://www.instagram.com/p/(.+)/\|\(Instagram\)> 

如果您有一个str对象要用str正则表达式进行分析,那么这种方法是有效的

如果文本是bytes对象,则需要先对其进行解码

# JSON files are normally encoded with UTF-8
json.dumps(data["event"]["attachments"][0]["text"]).decode('utf8`)

。。。或者使用bytes正则表达式

# note the `b` prefix for the regex pattern
INSTA_LINK_RE = re.compile(br'<(.+)\|\(Instagram\)>')

要直接获取包含str对象的dict,还可以将编码传递给open函数:

f = open('json-test-file-for-insta-url-snippet.json', encoding='utf-8`)

请参阅一些python文档以了解更多信息:

import json

link = json.dumps(data["event"]["attachments"][0]["text"])
link_list = ','.split(link)
for x in link_list:
    x = x[19:]
    if x.stratswith('https:'):
        i = '|'.split(x)
        link = i[0]

首先我将数据拆分成一个列表,然后浏览列表,直到找到以https开头的内容:(链接),然后在链接末尾再次拆分,并从列表中提取

因为结果是字符串格式。正则表达式是最聪明的方法(需要时间学习,但它是一个非常强大的工具)。但是,您可以使用名为instaloader的模块。不知道你在用什么,但instaloader对Instagram真的很有帮助

相关问题 更多 >