用于从页面上的html提取javascript变量的Python脚本

2024-09-30 18:30:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我的站点页面标题中包含以下javascript:

<script type='text/javascript'>
var gaProperty = 'UA-00000000-1';
var disableStr = 'ga-disable-' + gaProperty;
if ( document.cookie.indexOf( disableStr + '=true' ) > -1 ) {
window[disableStr] = true;
}
function gaOptout() {
document.cookie = disableStr + '=true; expires=Thu, 31 Dec 2099 23:59:59 UTC; path=/';
window[disableStr] = true;
}
</script>

我试图使用python从csv文件的url列表中的每个页面(即UA-00000000-1)提取var-gaProperty。我是python新手,从我看到的一些脚本中整理出一个脚本,但它不起作用:

from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re

list = []
with open('list.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        list.append(url) # Add each url to list contents
    

for url in list: 
    page = urlopen(url[0]).read()
    path = " ".join(url)
    soup = BeautifulSoup(page, "lxml")
    data = soup.find_all('script', type='text/javascript')
    gaid = re.search(r'UA-[0-9]+-[0-9]+', data[0].text)
    print(path, gaid)

我得到的不正确结果是:

https:www.example.com/contact-us/ None

我需要为每个url实现所需的输出:

https:www.example.com/contact-us/ UA-00000000-1

你知道如何在Python中使用它吗


Tags: csvpathtextinfromimporttrueurl
1条回答
网友
1楼 · 发布于 2024-09-30 18:30:02

更具体地说,我将在模式中包括var gaProperty,然后确保捕获组在后面的'之间惰性地捕获所有内容,即包装gaid值

import re

html ='''
<script type='text/javascript'>
var gaProperty = 'UA-00000000-1';
var disableStr = 'ga-disable-' + gaProperty;
if ( document.cookie.indexOf( disableStr + '=true' ) > -1 ) {
window[disableStr] = true;
}
function gaOptout() {
document.cookie = disableStr + '=true; expires=Thu, 31 Dec 2099 23:59:59 UTC; path=/';
window[disableStr] = true;
}
</script>'''

gaid = re.search(r"var gaProperty = '(.*?)'", html).group(1)
print(f'https:www.example.com/contact-us/{gaid}')

相关问题 更多 >