如何在Python中使用正则表达式从重定向URL提取URL?

2024-07-01 08:35:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从下面的test_string中获取实际的URL

测试字符串(部分显示):

An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020

部分测试字符串的所需输出

https://www.sciencedirect.com/science/article/pii/S0010218020301346

我一直试图通过将下面给出的MWE应用于许多字符串来获得这个结果,但它只给出一个URL

MWE

from urlparse import urlparse, parse_qs
import re
from re import search

test_string = '''
Production, Properties, and Applications of ALPHA-Terpineol
<http://scholar.google.com/scholar_url?url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&hl=en&sa=X&d=12771069332921982368&scisig=AAGBfm1tFjLUm7GV1DRnuYCzvR4uGWq9Cg&nossl=1&oi=scholaralrt&hist=v2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>

A Sales, L de Oliveira Felipe, JL Bicas
Abstract ALPHA-Terpineol (CAS No. 98-55-5) is a tertiary monoterpenoid 
alcohol widely
and commonly used in the flavors and fragrances industry for its sensory 
properties.
It is present in different natural sources, but its production is mostly 
based on ...
Save 
<http://scholar.google.com/citations?update_op=email_library_add&info=oB2z7uTzO7EJ&citsig=AMD79ooAAAAAYLfmix3sQyUWnFrHeKYZxuK31qlqlbCh&hl=en> 
    Twitter 
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=tw&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g> 
    Facebook 
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=fb&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g> 

An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Butanol/diesel blend is considered as a very promising alternative fuel
with
agreeable combustion and emission performance in engines. This paper
intends to
further investigate its autoignition characteristics with the combination
of a heated =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3DE=
27Gd756Qj4J&citsig=3DAMD79ooAAAAAYImDxwWCwd5S5xIogWp9RTavFRMtTDgS&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>

Using NMR spectroscopy to investigate the role played by copper in prion
diseases.
<http://scholar.google.com/scholar_url?url=3Dhttps://europepmc.org/article/=
med/32328835&hl=3Den&sa=3DX&d=3D16122276072657817806&scisig=3DAAGBfm1AE6Kyl=
jWO1k0f7oBnKFClEzhTMg&nossl=3D1&oi=3Dscholaralrt&hist=3Dv2Y_3P0AAAAJ:179499=
55323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
RA Alsiary, M Alghrably, A Saoudi, S Al-Ghamdi=E2=80=A6 - =E2=80=A6 and of =
the Italian
Society of =E2=80=A6, 2020
Prion diseases are a group of rare neurodegenerative disorders that develop
as a
result of the conformational conversion of normal prion protein (PrPC) to
the disease-
associated isoform (PrPSc). The mechanism that actually causes disease
remains =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3Dz=
pCMKavUvd8J&citsig=3DAMD79ooAAAAAYImDx3r4gltEWBAkhl0g2POsXB9Qn4Lk&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>


'''

regex = re.compile('(http://scholar.*?)&')
url_all = regex.findall(test_string)
citation_url = []
for i in url_all:
    if search('scholar.google.com',i):
        qs = parse_qs(urlparse(i).query).values()
        if search('http',str(qs[0])):
            citation_url.append(qs[0])
print citation_url

当前输出

https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf

所需输出

https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
https://www.sciencedirect.com/science/article/pii/S0010218020301346
https://europepmc.org/article/med/3232883

如何在Python中处理URL文本的等号换行和提取重定向URL


Tags: andofthehttpscomhttpurlpdf
2条回答

可以使用字符类匹配问号或符号[&?]。查看示例数据,对于url=部分,可以添加可选的换行符和可选的等号,并进行相应的调整

有些URL以3D开头,您可以使用非捕获组(?:3D)?使该部分成为可选部分

然后在组1中捕获匹配的http,然后匹配除&之外的所有字符

\bhttp://scholar\.google\.com.*?[&?]\n?u=?\n?r\n?l\n?=(?:3D)?(http[^&]+)

Regex demo

看到这个正则表达式模式了吗?我想它可能有助于提取重定向uri

(http:\/\/scholar[\w.\/=&?]*)[?]?u[=]?rl=([\w\:.\/\-=]+)

也可以在这里看到这个例子https://regex101.com/r/dmkF3h/3

相关问题 更多 >

    热门问题