如何在Python和uls中提取数据?

2024-10-02 22:35:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从cd数据中提取post_id

<script type='text/javascript' data-cfasync='false'>
  //<![CDATA[
    _SHR_SETTINGS = {"endpoints":{"local_recs_url":"https:\/\/klaudynahebda.pl\/wp-admin\/admin-ajax.php?action=shareaholic_permalink_related"},"url_components":{"year":"2018","monthnum":"06","day":"19","post_id":"21132","postname":"letnie-warsztaty-ziolowo-kosmetyczne-7-9lipiec","author":"admin"}};
  //]]>
</script>

我可以获取整个CData,但不知道下一步该怎么办?在


Tags: 数据textidfalseurldataadmintype
2条回答

如果您只需要post_id,请尝试使用regex。在

例如:

import re
s = """<script type='text/javascript' data-cfasync='false'>
  //<![CDATA[
    _SHR_SETTINGS = {"endpoints":{"local_recs_url":"https:\/\/klaudynahebda.pl\/wp-admin\/admin-ajax.php?action=shareaholic_permalink_related"},"url_components":{"year":"2018","monthnum":"06","day":"19","post_id":"21132","postname":"letnie-warsztaty-ziolowo-kosmetyczne-7-9lipiec","author":"admin"}};
  //]]>
</script>"""
m = re.search(r'(?<="post_id":\")(?P<post_id>.*?)(?=\",\")', s)
if m:
    print(m.group('post_id'))

输出:

^{pr2}$

也许这不是一个超级解决方案,但我明白了

from bs4 import BeautifulSoup

html = """
<script type='text/javascript' data-cfasync='false'>
//<![CDATA[
    _SHR_SETTINGS = {"endpoints":{"local_recs_url":"https:\/\/klaudynahebda.pl\/wp-admin\/admin-ajax.php?action=shareaholic_permalink_related"},"url_components":{"year":"2018","monthnum":"06","day":"19","post_id":"21132","postname":"letnie-warsztaty-ziolowo-kosmetyczne-7-9lipiec","author":"admin"}};
//]]>
</script>
"""

soup = BeautifulSoup(html, 'lxml')

dct = {}

for scr in soup.find_all('script'):
    for x in scr.text.split(','):
        if 'post_id' in x:
            k, v = x.replace('"', '').split(':')
            dct[k] = v

print(dct['post_id'])

输出

^{pr2}$

相关问题 更多 >