web抓取信息并将其打印到csv文件中

2024-05-18 20:54:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要用Python(beautifulsoup或scrapy)解析HTML文件中的信息,然后将其打印到csv文件中。相关信息是文件名和在我的帐户here中看到的次数

有关次数的相关HTML:

<div class="hidden-tiles views C C1">
      <nobr class="hidden-xs">num </nobr>
      <nobr class="hidden-sm hidden-md hidden-lg">num</nobr>
</div>

文件名的相关HTML:

<div class="ttl">
       {filename}
</div>

我能做的是:

import requests  
page = requests.get("https://archive.org/details  /%40kareem76?&sort=-publicdate&page=2")  
page  
page.content  
nbr = BeautifulSoup(page.content, 'html.parser')  
nbr.find_all('div', class_='hidden-tiles views C C1')

Tags: 文件div信息文件名htmlpagerequests次数
2条回答

也许这是另一个解决方案

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://archive.org/details/@kareem76?&sort=-publicdate&page=2'
html = req.get(url)
doc = SimplifiedDoc(html)
blocks = doc.selects('div.results>div.item-ia').notContains(['mobile-header','hidden-tiles','collection-ia'],attr='class')
for block in blocks:
  nums = block.selects('div.hidden-tiles views C C1>nobr>text()')
  title = block.select('div.ttl>text()')
  print (title, nums[0],nums[1])

结果:

ننتصر او ننتصر من اجل الربيع العربي المنصف المرزوقي 1,056 1.1K
الرحلة مذكرات آدمي المنصف المرزوقي ط.مزيدة و منقحة 874 874
الثورة التونسية المجيدة، بنية ثورة وصيرورتها من خلال يومياتها عزمي بشارة الطبعة الثانية 469 469
The Case For Impeachment Allan J. Lichtman 65 65
CONTRAT ASSURANCE CREDIT MACRON ALLIANZ 137 137
...

此代码应执行以下操作:

import requests  
from bs4 import BeautifulSoup
import pandas as pd


html = requests.get("https://archive.org/details/@kareem76").text

soup = BeautifulSoup(html, 'html.parser')  
titles = [i.text.strip() for i in soup.find_all('div', class_='ttl')]
views = [i.find('nobr').text for i in soup.find_all('div', class_='hidden-tiles views C C1')]

df = pd.DataFrame({'titles':titles,
                  'views':views})


df.to_csv("titles-views.csv",
          mode='w',
          index = None,
          header=True)

你会得到(只是摘录):

enter image description here

相关问题 更多 >

    热门问题