简单的网页刮板格式,我如何解决这个问题?

2024-05-19 23:02:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这个密码:

import requests
from bs4 import BeautifulSoup



def posts_spider():
    url = 'http://www.reddit.com/r/nosleep/new/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll('a', {'class': 'title'}):
        href = "http://www.reddit.com" + link.get('href')
        title = link.string
        print(title)
        print(href)
        print("\n")

def get_single_item_data():
    item_url = 'http://www.reddit.com/r/nosleep/new/'
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for rating in soup.findAll('div', {'class': 'score unvoted'}):
        print(rating.string)

posts_spider()
get_single_item_data()

输出为:

My light.. I'm seeing and feeling things.. what's happening?
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/


Why being the first to move in a new Subdivision is not the most brilliant idea...
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/


I Am Falling.
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/


Heidi
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/


I remember everything
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/


To Lieutenant Griffin Stone
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/


The woman in my room
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/


Dr. Margin's Guide to New Monsters: The Guest, or, An Update
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/


The Evil Woman (part 5)
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/


Blood for the blood god, The first of many.
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/


An introduction to the beginning of my journey
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/


A hunter..of sorts.
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/


Void Trigger
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/


What really happened to Amelia Earhart
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/


I Used To Be Fine Being Alone
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/


The Green One
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/


Elevator
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/


Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/


Cranial Nerve Zero
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/


Mom's Story About a Ghost Uncle
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/


It snowed.
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/


The pocket watch I found at a store
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/


You’re Going To Die When You Are 23
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/


The Customer: Part Two
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/


Dimenhydrinate
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/


•
•
•
•
•
12
12
76
4
2
4
6
4
18
2
6
13
5
16
2
2
14
48
1
13

我想做的是,将每个帖子的匹配评分放在它旁边,这样我就可以立即知道该帖子有多少评分,而不是在一个“块”中打印标题和链接,在另一个“块”中打印评分数字。 提前感谢您的帮助!你知道吗


Tags: ofthetotextincomhttpnew
1条回答
网友
1楼 · 发布于 2024-05-19 23:02:49

您可以通过使用class="thing"迭代div元素一次完成(可以将其视为对post的迭代)。对于每个div,获取链接和评级:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

def posts_spider():
    url = 'http://www.reddit.com/r/nosleep/new/'
    soup = BeautifulSoup(requests.get(url).content)
    for thing in soup.select('div.thing'):
        link = thing.find('a', {'class': 'title'})
        rating = thing.find('div', {'class': 'score'})
        href = urljoin("http://www.reddit.com", link.get('href'))

        print(link.string, href, rating.string)

posts_spider()

仅供参考,div.thing是一个^{},它将所有divclass="thing"匹配。你知道吗

相关问题 更多 >