<p>您可以按照其他人的建议使用<a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse" rel="nofollow noreferrer">^{<cd1>}</a>,也可以使用<a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit" rel="nofollow noreferrer">^{<cd2>}</a>。但是,两者都不会处理<code>www.cemexusa.com</code>。因此,如果您的密钥中不需要该方案,您可以使用如下内容:</p>
<pre class="lang-py prettyprint-override"><code>def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
</code></pre>
<hr/>
<p>下面是一个完整的工作示例:</p>
<pre class="lang-py prettyprint-override"><code>import pandas as pd
import io
from urllib.parse import urlsplit
df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")
df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")
df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)
joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
</code></pre>
<p><code>print(joined)</code>的输出将是:</p>
<pre class="lang-py prettyprint-override"><code> Description Key Last Update
0 Junk Food www.mcdonalds.com 2021
1 Cemex www.cemexusa.com 2020
</code></pre>
<hr/>
<p>本答复中可能有其他特殊情况未处理。根据您的数据,您可能还需要处理省略的<code>www</code>:</p>
<pre class="lang-py prettyprint-override"><code>urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com
urlsplit("https://www.realpython.com").hostname # also a valid URL
# www.realpython.com
</code></pre>
<hr/>
<p>{<cd1>}和{<cd2>}之间有什么区别</p>
<p>这取决于您的用例和您想要提取的信息。因为您不需要URL的<code>params</code>,所以我建议使用<code>urlsplit</code></p>
<blockquote>
<p>[<code>urlsplit()</code>] is similar to <code>urlparse()</code>, but does not split the <code>params</code> from the URL. <a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit" rel="nofollow noreferrer">https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit</a></p>
</blockquote>