如何在Python中从URL中删除.com和“https://”之后的字符串

df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your- location']} df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']} df1['url']==df2['url'] Out[7]: False

3条回答

网友

1楼 · 编辑于 2024-09-30 06:14:50

使用^{}并隔离主机名：

from urllib.parse import urlparse

urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'

网友

2楼 · 编辑于 2024-09-30 06:14:50

您可以按照其他人的建议使用^{}，也可以使用^{}。但是，两者都不会处理www.cemexusa.com。因此，如果您的密钥中不需要该方案，您可以使用如下内容：

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname

df1["Key"] = df1["URL"].apply(to_key)

下面是一个完整的工作示例：

import pandas as pd
import io

from urllib.parse import urlsplit

df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")

df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")

df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname
    
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)

joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))

# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)

print(joined)的输出将是：

  Description                Key  Last Update
0   Junk Food  www.mcdonalds.com         2021
1       Cemex   www.cemexusa.com         2020

本答复中可能有其他特殊情况未处理。根据您的数据，您可能还需要处理省略的www：

urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com

urlsplit("https://www.realpython.com").hostname  # also a valid URL
# www.realpython.com

{}和{}之间有什么区别

这取决于您的用例和您想要提取的信息。因为您不需要URL的params，所以我建议使用urlsplit

[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit

网友

3楼 · 编辑于 2024-09-30 06:14:50

URL解析起来并不简单。看看标准库中的urllib module

以下是如何删除域后的路径：

import urllib.parse

def remove_path(url):
    parsed = urllib.parse.urlparse(url)
    parsed = parsed._replace(path='')
    return urllib.parse.urlunparse(parsed)

df1['url'] = df1['url'].apply(remove_path)

相关问题更多 >

编程相关推荐

热门问题

热门文章