从页面源代码获取特定部分

2024-09-30 16:28:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用regex从页面中提取特定部分,但它不起作用。你知道吗

这是我想从页面中提取的部分:

{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}

到目前为止,我已经试过了:

import requests
import re


r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text

mystrx = re.search(r'^{"clickTrackingParams".*"voteStatus":"LIKE"}}]}}', html_source)

但我没有成功。你知道吗


Tags: importreurlsourcehtmlservice页面requests
1条回答
网友
1楼 · 发布于 2024-09-30 16:28:44

试试这个:

import requests
import re

r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text

fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'

# Find first occurence
end = html_source.find(snd)

# Get closest index
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)

print(html_source[start:end+len(snd)])

输出:

{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}

如果您想获得第二次出现,可以尝试以下方法:

import requests
import re

r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text

fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'

def find_nth(string, to_find, n):
    """
    Finds nth match from string
    """

    # find all occurences
    matches = [idx.start() for idx in re.finditer(to_find, string)]

    # return nth match
    return matches[n]

# finds second match
end = find_nth(html_source, snd, 1)

# Gets closest index to end
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)

print(html_source[start:end+len(snd)])

注意:在第二个示例中,如果请求在找到的匹配项之外出现,则可以运行IndexError。你需要自己处理这种行为。你知道吗

相关问题 更多 >