如何正确地从网页中刮取?

2024-09-30 22:16:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用python从该页面获取“上次更改”的日期和时间:

https://www.apg.at/transparency/Visualization.aspx?PRESENTATIONDESCRIPTION=DAFTG&LANGUAGE=en#mode|,|table|,|from|,|20190216|,|resolution|,|15M

页码:https://imgur.com/a/hsVl7e1

代码:https://imgur.com/a/jHWcFDh

我试过以不同的方式使用libarysbs4、soup和urllib

我确实得到了一些数据,但是有些数据丢失了,包括我需要的部分

打印完后,我希望能在输出的某处找到“上次更改日/月/年”

有没有更好的方法,或者我错过了什么


Tags: 数据httpscomwww时间页面languageat
1条回答
网友
1楼 · 发布于 2024-09-30 22:16:24
import requests
import lxml.html as lh
import pandas as pd

url= YOUR URL
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print '%d:"%s"'%(i,name)
    col.append((name,[]))

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]

    #i is the index of our column
    i=0

    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

相关问题 更多 >