在为匹配的行分配字典键时用字典值筛选数据帧？

urls_list = ['http://www.ajc.com/news/world/atlan...', 'http://www.seattletimes.com/sports/...', 'https://www.cjr.org/q_and_a/washing...', 'https://www.washingtonpost.com/grap...', 'https://www.nytimes.com/2017/09/01/...', 'http://www.oregonlive.com/silicon-f...'] df = pd.DataFrame(urls_list,columns=['Links'])

pub_list = [] for row in df['Links']: for k,v in urls_dict.items(): if row.find(v) > -1: publication = k else: publication = None pub_list.append(publication)

2条回答

网友
1楼 · 编辑于 2024-09-28 03:16:40

我能够使用嵌套字典理解（或者，使用嵌套列表理解）和一些额外的数据帧操作来清理列和删除空行。在
使用嵌套词典理解（或者更具体地说，嵌套在列表理解中的词典理解）：
df['Publication'] = [{k: k for k,v in urls_dict.items() if v in row} for row in df['Links']] # Format the 'Publication' column to get rid of duplicate 'key' values df['Publication'] = df['Publication'].astype(str).str.strip('{}').str.split(':',expand=True)[0] # Remove blank rows from 'Publication' column df = df[df['Publication'] != '']
类似地，使用嵌套列表理解：
^{pr2}$

网友
2楼 · 编辑于 2024-09-28 03:16:40

我要做的是：
使用DataFrame.apply向仅包含域的数据帧添加新列。
使用DataFrame.merge（带how='inner'选项）合并域字段上的两个数据帧。
如果循环只是在列或行上迭代，那么使用循环对数据帧执行操作有点脏，而且通常有一个DataFrame方法可以更干净地执行相同的操作。在
如果你愿意，我可以用例子来扩展。在
编辑下面是这样的。请注意，我在域捕获中使用了相当糟糕的regex。在
def domain_extract(row): s = row['Links'] p = r'(?:(?:\w+)?(?::\/\/)(?:www\.)?)?([A-z0-9.]+)\/.*' m = re.match(p, s) if m is not None: return m.group(1) else: return None df['Domain'] = df.apply(domain_extract, axis=1) dfo = pd.DataFrame({'Name': ['Atlanta Journal-Constitution', 'The Washington Post', 'The New York Times'], 'Domain': ['ajc.com', 'washingtonpost.com', 'nytimes.com']}) df.merge(dfo, on='Domain', how='inner')[['Links', 'Domain', 'Name']]

相关问题更多 >

编程相关推荐

热门问题

热门文章