如何在Python中从URL中删除.com和“https://”之后的字符串

2024-09-30 06:14:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要使用url作为主键来合并两个数据帧。然而,url中有一些额外的字符串,如在df1中,我有https://www.mcdonalds.com/us/en-us.html,而在df2中,我有https://www.mcdonalds.com

我需要从url中删除.com之后的/us/en-us.html和https://以便可以使用两个dfs之间的url执行合并。下面是一个简化的例子。解决这个问题的办法是什么

df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your- 
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}

df1['url']==df2['url']
Out[7]: False

谢谢


Tags: 数据字符串httpscomurlhtmlwwwen
3条回答

使用^{}并隔离主机名:

from urllib.parse import urlparse

urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'

您可以按照其他人的建议使用^{},也可以使用^{}。但是,两者都不会处理www.cemexusa.com。因此,如果您的密钥中不需要该方案,您可以使用如下内容:

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname

df1["Key"] = df1["URL"].apply(to_key)

下面是一个完整的工作示例:

import pandas as pd
import io

from urllib.parse import urlsplit

df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")

df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")

df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname
    
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)

joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))

# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)

print(joined)的输出将是:

  Description                Key  Last Update
0   Junk Food  www.mcdonalds.com         2021
1       Cemex   www.cemexusa.com         2020

本答复中可能有其他特殊情况未处理。根据您的数据,您可能还需要处理省略的www

urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com

urlsplit("https://www.realpython.com").hostname  # also a valid URL
# www.realpython.com

{}和{}之间有什么区别

这取决于您的用例和您想要提取的信息。因为您不需要URL的params,所以我建议使用urlsplit

[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit

URL解析起来并不简单。看看标准库中的urllib module

以下是如何删除域后的路径:

import urllib.parse

def remove_path(url):
    parsed = urllib.parse.urlparse(url)
    parsed = parsed._replace(path='')
    return urllib.parse.urlunparse(parsed)

df1['url'] = df1['url'].apply(remove_path)

相关问题 更多 >

    热门问题