仅根据特定域筛选数据帧中的链接

url_id | link ------------------------------------------------------------------------ 1 | http://www.example.com/somepath 2 | http://www.somelink.net/example 3 | http://other.someotherurls.ac.uk/thisissomelink.net&part/sample 4 | http://part.example.com/directory/files

from tld import get_tld import pandas as pd def urlparsing(row): url = row['link'] res = get_tld(url,as_object=True) return (res.fld) link = ({"url_id":[1,2,3,4],"link":["http://www.example.com/somepath", "http://www.somelink.net/example", "http://other.someotherurls.ac.uk/thisissomelink.net&part/sample", "http://part.example.com/directory/files"]}) domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com'] df_link = pd.DataFrame(link) ref_dom = [] for dom in domains: ddd = df_link[(df_link.apply(lambda row: urlparsing(row), axis=1)).str.contains(dom, regex=False)] ref_dom.append([dom, len(ddd)]) pd.DataFrame(ref_dom, columns=['domain','no_of_links'])

1条回答

网友

1楼 · 发布于 2024-09-28 22:19:18

可以使用regex和df.str函数的findall函数来实现

domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']
pat = "|".join([f"http[s]?://(?:\w*\.)?({domain})" 
                 for domain in map(lambda x: x.replace(".","\."), domains)])
match = df["link"].str.findall(pat).explode().explode()
match = match[match.str.len()>0]
match.groupby(match).count()

结果

link
example.com     2
somelink.net    1
Name: link, dtype: int64

对于0.25之前的大熊猫

domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']
pat = "|".join([f"http[s]?://(?:\w*\.)?({domain})" 
                 for domain in map(lambda x: x.replace(".","\."), domains)])

match = df["link"].str.findall(pat) \
.apply(lambda x: "".join([domain for match in x for domain in match]).strip())

match = match[match.str.len()>0]
match.groupby(match).count()

要获得0个链接的域，你也可以加入df所有域的结果

相关问题更多 >

编程相关推荐

热门问题

热门文章