Python网站抓取如何抓取这类网站？问题的回答

Python网站抓取如何抓取这类网站？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这里我们使用<code>requests</code>、<code>BeautifulSoup</code>和<code>pandas</code>： <pre><code>import requests from bs4 import BeautifulSoup import pandas as pd url = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page=' num = int(input('How Many Page to Parse?> ')) print('please wait....') name = [] desc = [] cat = [] sub = [] for i in range(0, num): r = requests.get(f"{url}{i}") soup = BeautifulSoup(r.text, 'html.parser') for item1 in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}): name.append(item1.text) for item2 in soup.findAll('td', attrs={'class': 'views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8'}): desc.append(item2.text) for item3 in soup.findAll('td', attrs={'class': 'views-field views-field-field-article-primary-category'}): cat.append(item3.text) for item4 in soup.findAll('td', attrs={'class': 'views-field views-field-created'}): sub.append(item4.text) result = [] for item in zip(name, desc, cat, sub): result.append(item) df = pd.DataFrame( result, columns=['API Name', 'Description', 'Category', 'Submitted']) df.to_csv('output.csv') print('Task Completed, Result saved to output.csv file.') </code></pre> 结果可以在线查看：<a href="https://sheet.zoho.com/sheet/editor.do?doc=886bbe3d1c94a844b456f19ed051845db227f822eb0dd237fe8fa6a7529ed5707f42235af0d7c9e303b85a8def35465bd922dd454afc384e26085db3d391d38c" rel="nofollow noreferrer">Check Here</a> 输出简单： <a href="https://i.ibb.co/CHz0dmD/Capture.png" rel="nofollow noreferrer"><img src="https://i.ibb.co/CHz0dmD/Capture.png" alt="enter image description here"/></a> 现在进行<code>href</code>解析： <pre><code>import requests from bs4 import BeautifulSoup import pandas as pd url = 'https://www.programmableweb.com/category/all/apis?deadpool=0&page=' num = int(input('How Many Page to Parse?> ')) print('please wait....') links = [] for i in range(0, num): r = requests.get(f"{url}{i}") soup = BeautifulSoup(r.text, 'html.parser') for link in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}): for href in link.findAll('a'): result = 'https://www.programmableweb.com'+href.get('href') links.append(result) spans = [] for link in links: r = requests.get(link) soup = soup = BeautifulSoup(r.text, 'html.parser') span = [span.text for span in soup.select('div.field span')] spans.append(span) data = [] for item in spans: data.append(item) df = pd.DataFrame(data) df.to_csv('data.csv') print('Task Completed, Result saved to data.csv file.') </code></pre> 在线检查结果：<a href="https://sheet.zoho.com/sheet/editor.do?doc=8059035d83f7549efc8d1e42de7adb3b50c56978993e7379e34d6137213248a05ad614759e69f95eabd37cf3be45dcacd5eec2e45c58ce6fa96e73acf48e76c6" rel="nofollow noreferrer">Here</a> 示例视图如下： <a href="https://i.stack.imgur.com/OThba.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/OThba.png" alt="enter image description here"/></a> 如果您希望将这2<code>csv</code>个文件放在一起，那么下面是代码： <pre><code>import pandas as pd a = pd.read_csv("output.csv") b = pd.read_csv("data.csv") merged = a.merge(b) merged.to_csv("final.csv", index=False) </code></pre> 联机结果：<a href="https://sheet.zoho.com/sheet/editor.do?doc=041e1cfa23ee5a62418b155925380d179f1934cbc78c3c9cfe8a089da65dec3175b8cad03a09cc8d5b330dcea65895657872c513b23346264b15bafff42004ba" rel="nofollow noreferrer">Here</a>

Python网站抓取如何抓取这类网站？

1 个回答

相关Python问题