<p>您可以通过模仿页面发出的请求来使用<code>requests</code>和<code>bs4</code>完成整个过程。您只需按正确的顺序循环区域,并将当前区域编号添加到每个请求中的<code>'CGEO'</code>参数</p>
<hr/>
<p>这:</p>
<pre><code>soup = bs(s.get(url).content, 'lxml')
regions = [i.text.strip() for i in soup.select('#REGIONSLIST option')]
</code></pre>
<p>从登录url收集区域名称的初始列表</p>
<hr/>
<p>这:</p>
<pre><code>for k,v in regions.items():
params = (('type', 'Region'), ('CGEO', v), ('them', '5'))
</code></pre>
<p>使用区域的<code>option</code>tag<code>value</code>属性设置<code>CGEO</code>参数,例如。
<code>Tanger-Tetouan-Al Hoceima</code>是<code>'01'</code></p>
<p><code>Region</code>选项在<code>type</code>参数内设置</p>
<p><code>Langues locales utilisées</code>选项在<code>them</code>参数内设置,即<code>'5'</code></p>
<hr/>
<p>这:</p>
<pre><code>for y in range(3):
row.extend([data[i-y+2]['DATA2014']])
</code></pre>
<p>只需反转项的顺序,以便<code>data</code>内每个字典中的<code>Ens, Fem, Masc</code>以所需的输出顺序<code>Masc, Fem, Ens</code>添加到<code>row</code></p>
<hr/>
<p><strong>Py:</strong></p>
<pre><code>import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
def add_rows(region, data):
for i in range(0, len(data)//3, 3):
row = [region, data[i]['INDICATEUR'].split('_')[-1]]
for y in range(3):
row.extend([data[i-y+2]['DATA2014']])
final.append(row)
url = 'http://rgphentableaux.hcp.ma/Default1'
headers= {'User-Agent': 'Mozilla/5.0', 'Referer': url}
final = []
with requests.Session() as s:
s.headers = headers
soup = bs(s.get(url).content, 'lxml')
regions = {i.text.strip():i['value'].strip() for i in soup.select('#REGIONSLIST option')}
for k,v in regions.items():
params = (('type', 'Region'), ('CGEO', v), ('them', '5'))
r = s.get(f'{url}/getDATA/', params=params)
data = r.json()
add_rows(k, data)
df = pd.DataFrame(final, columns = ['Region', 'Lang', 'Masc', 'Fem', 'Ens'])
print(df)
</code></pre>
<hr/>
<p><strong>编辑:</strong></p>
<p>要获取所有3个表(ensemble、urbain、rural),请按如下所示调整自定义函数,并添加到附加循环<code>for n in range(0, len(data), block)</code>:</p>
<pre><code>import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
def add_rows(table, region, data_block):
for i in range(0, len(data_block), 3):
row = [table, region, data_block[i]['INDICATEUR'].split('_')[-1]]
for y in range(3):
row.extend([data_block[i-y+2]['DATA2014']])
final.append(row)
url = 'http://rgphentableaux.hcp.ma/Default1'
headers= {'User-Agent': 'Mozilla/5.0', 'Referer': url}
tables = ['ens', 'urb', 'rur']
final = []
with requests.Session() as s:
s.headers = headers
soup = bs(s.get(url).content, 'lxml')
regions = {i.text.strip():i['value'].strip() for i in soup.select('#REGIONSLIST option')}
for k,v in regions.items():
params = (('type', 'Region'), ('CGEO', v), ('them', '5'))
r = s.get(f'{url}/getDATA/', params=params)
data = r.json()
block = len(data)//3
for n in range(0, len(data), block):
table = tables[n//block]
add_rows(table, k, data[n:n+block])
df = pd.DataFrame(final, columns = ['Table', 'Region', 'Language', 'Masc', 'Fem', 'Ens'])
print(df)
</code></pre>