<h2 id="conclusion-07-25-2021-crbt">结论-07-25-2021</h2>
<hr/>
<p><em><strong>在更详细地研究了这个问题之后,我认为在技术上不可能使用Python <code>Requests</code>来删除您问题中的网站和表格。</strong></em></p>
<p><em><strong>这意味着你的问题无法以你喜欢的方式解决。</strong></em></p>
<hr/>
<p><strong>为什么?</strong></p>
<p>该网站采用了防刮机制</p>
<p><code>GBK</code>值只是这些机制的一部分。您要刮的表有1504页。每次从第1页导航到下一页时,都会创建一个新的唯一<code>GBK</code>值,以此类推。因此有1503个唯一的<code>GBK</code>值</p>
<p>该站点还为每个页面使用唯一的会话管理cookie</p>
<pre><code># page 1 cookie
JSESSIONID=0AC56294FE6857A236F0E68A9106E1AE.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPEbke3aQqqqm_6QLwIaUKu0tMygss.En464jhvNz1mMzbOatzmLLtv9x_xiCP6JaO_JzcbvHqtsnQYydBa6B_YjSg6sFm7cVBBOhB35_.TZuwDsbOnDinJkNwMs3AaMPtM83dP9YnogFKHpNJo5.RHMTKT6_XNPr0mxebR6stRrQ7LFfACcWqHHhbc.j6gZfZzxsgwnPE3RGP6aT9nYuMJbvK2EGrdAv0O12G03KTk_BMk.xLeEwrQq5VjyH1tB7t4wQ.jQ1geshvbDPCs8_VHCkd2.6uIag5Md.lngzeDshhSjMrmBjyy0HTqAXQ3; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583
# page 2 cookie
JSESSIONID=651AD12FD349FFB1842E08CA578EA37D.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPeKke3e7qqqm_6Q_YEqK9dBPNnJQF00YvHDMLHlJeb.4rrpTsgfwZxU0S5OXIAB2aduoOTmj7RuKIL.LUXRaRqfh5ZByuTFX3LxK1Ia3sr3V45c.PPx6Eas5EF5EkQztquzrX78QIbjrJUcQoKoOKcqgX5UuRIN0gCyGDyI6FFj.JbPhwYf65Hcx9BzDQnrlGAPHM3WGvmKf7OJnLY1SGIuxtdyVUE359Ll2lr0QJxUq1Dacqz_WsFa_ZantBbP7MklHX6J21wmDnyo6s4xCeeTYwsGq.kGUbE74Dx.QjQBCM_SiLKccTog8_EdBDg; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583
# page 3 cookie
JSESSIONID=2121D74E0EFCEC3BE104DAA2791481B6.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPeKke3e7qqqm_6wBfEGBZsTF9_uGtgepzPXNOzFh0RNtGcE1Cf4hEQNppVywcI5mk3SlLkzvNll6ovr4XmfL2Ujy3AFZR5leVY2H3_584At3GmIwmnsEjOx5v5e_lMon3AbX9t2W8UiLoK.9SBX0vgNRfkqdpyPjWKk3Zs8gQG0k3_6UwxGTvEwWkaWL8vquJgCGlvLEFTjNvd07eHiR482UfpLPFP6yAkx8Wi9pM79cL.26KE3U2L79hgBKLHyOdNyj3VKOkDsaXefNdPXd.YqT4kevShGxzMM2PuzqnuuQnW.GQ5mr9Rx8VxUjEa; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583
</code></pre>
<p>因此,您需要为2到1504之间的每个页面获取唯一的<code>GBK</code>值和唯一的会话管理cookie</p>
<p>我还注意到该站点采用了某种类型的延迟。第一页可能需要一些时间才能完全加载。如果在此页面加载完成之前尝试导航到其他页面,您将收到此消息“请勿频繁操作!" </p>
<p>有些页面加载时间长达2分钟,这不值得。当它们没有加载时,上面的消息会显示出来</p>
<> >e>强>就像前面提到的,你应该考虑用{{CD6}}来刮除这个站点,这可能绕过防刮机制。<强> EEM > </P>
<h2 id="update-post-selenium-07-27-2021-2fc2">更新帖子(硒)-07-27-2021</h2>
<hr/>
<p>我试图用<code>selenium</code>刮取您的目标网站。<code>chromedriver</code>持续无法连接到该网站。即使使用了以下开关:</p>
<pre><code>chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
chrome_options.add_argument(" disable-blink-features=AutomationControlled")
</code></pre>
<p>我还尝试使用<a href="https://pypi.org/project/undetected-chromedriver/" rel="nofollow noreferrer">undetected-chromedriver</a>,但也无法连接属性</p>
<p>即使我使用<code>driver.set_page_load_timeout()</code>或<code>driver.implicitly_wait()</code>设置了高超时时间,会话仍然会使用<code>chromedriver</code>失败</p>
<p>当我使用<code>selenium</code>和<code>geckodriver.</code>时,我能够访问该网站。仍然存在超时问题,但是添加一个<code>WebDriverWait</code>和一个<code>expected_conditions</code>组合似乎可以克服一些超时问题</p>
<p>由于使用了反刮擦机制,刮擦这个网站将是一个漫长而艰巨的过程</p>
<h2 id="original-post-07-24-2021-h9v1">最初发布于2021年7月24日</h2>
<hr/>
<p><em>首先让我说,您的问题没有足够的细节来为解决您的问题提供可靠的建议。</em></p>
<p>我调查了你的问题。我发现你的目标网站使用了一些Javascript来动态创建每个post请求使用的<code>GBK</code>值</p>
<pre><code>6SQk6G2z:GBK-5lkb7acLMDDxywZsCHoJagJlT50f1gw4.jaVgaBpBcGZDs1T_pcR_OPFgvOm_6oM8PfyL3L6xDPxFqgIqgwbVAEw8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ
</code></pre>
<p><a href="https://i.stack.imgur.com/BBnhT.jpg" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/BBnhT.jpg" alt="enter image description here"/></a></p>
<p>这个<code>GBK</code>值在第1页到第1504页之间导航时使用。我注意到每个页面的值都会发生变化</p>
<p><a href="https://i.stack.imgur.com/Sb2Vc.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/Sb2Vc.png" alt="enter image description here"/></a></p>
<pre><code>import difflib
# page 2
a = "6SQk6G2z:GBK-5lkb7acLMDDxywZsCHoJagJlT50f1gw4.jaVgaBpBcGZDs1T_pcR_OPFgvOm_6oM8PfyL3L6xDPxFqgIqgwbVAEw8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ"
# page 1504
b = "6SQk6G2z:GBK-59tY9cXfYPiYfpgB1rj16jFZNwQuke.NUV5ZljqD6daOH4pxgaFcRE7bERjrvfoY4OTl5PAWUo70VNRIqnYOi_TQCSWzvrcCgfTtEFl_ZdMHRVLhosJLSFwHiPdVn4cXZ7VnF5xahstqJHD6EBfd71iZT8HQBmx1dssd7RWA2Gdv8lGhJbS0ZeaxIVkfK5qaO.lxHVvG_9cq4weBdHeUQlGlIWhxKFYePkTr9Jp0eN2yDTZljeX0XWWOxIjEkdj89FOqaNDB2slUE.54oC96baGe7lttoz_2AoTbjHSTjfDh.eSyT6vA6.5dP5X.4XsFVYSnYKIznIdkjTURmm3kjvGM_iQoYT3V5gAKs1c6r6cE"
s = difflib.SequenceMatcher(None, a, b, autojunk=False)
for tag, i1, i2, j1, j2 in s.get_opcodes():
if tag != 'equal':
print('{:7} a[{}:{}] > b[{}:{}] {!r:>8} > {!r}'.format(
tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
# output
insert a[14:14] > b[14:49] '' > '9tY9cXfYPiYfpgB1rj16jFZNwQuke.NUV5Z'
replace a[15:18] > b[50:55] 'kb7' > 'jqD6d'
replace a[19:24] > b[56:60] 'cLMDD' > 'OH4p'
delete a[25:49] > b[61:61] 'ywZsCHoJagJlT50f1gw4.jaV' > ''
insert a[51:51] > b[63:158] '' > 'FcRE7bERjrvfoY4OTl5PAWUo70VNRIqnYOi_TQCSWzvrcCgfTtEFl_ZdMHRVLhosJLSFwHiPdVn4cXZ7VnF5xahstqJHD6E'
insert a[52:52] > b[159:243] '' > 'fd71iZT8HQBmx1dssd7RWA2Gdv8lGhJbS0ZeaxIVkfK5qaO.lxHVvG_9cq4weBdHeUQlGlIWhxKFYePkTr9J'
insert a[53:53] > b[244:276] '' > '0eN2yDTZljeX0XWWOxIjEkdj89FOqaND'
replace a[54:55] > b[277:291] 'c' > '2slUE.54oC96ba'
replace a[56:57] > b[292:311] 'Z' > 'e7lttoz_2AoTbjHSTjf'
insert a[58:58] > b[312:369] '' > 'h.eSyT6vA6.5dP5X.4XsFVYSnYKIznIdkjTURmm3kjvGM_iQoYT3V5gAK'
delete a[60:63] > b[371:371] 'T_p' > ''
delete a[64:74] > b[372:372] 'R_OPFgvOm_' > ''
replace a[75:84] > b[373:374] 'oM8PfyL3L' > 'r'
replace a[85:99] > b[375:376] 'xDPxFqgIqgwbVA' > 'c'
delete a[100:377] > b[377:377] 'w8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ' > ''
</code></pre>
<p><code>GBK</code>值是在页面的HTML中通过此调用创建的</p>
<pre><code>javascript:commitForECMA(callbackC,"content.jsp?tableId=27&tableName=TABLE27&tableView=杩涘彛鍖荤枟鍣ㄦ浜у搧锛堟敞鍐?&Id=60456",null)
</code></pre>
<p>这就是所调用的Javascript</p>
<pre><code>function commitForECMA($_17, $_12, $_19) {
request = createXMLHttp();
request.onreadystatechange = $_17;
if ($_19 == null) {
_$b6(request, _$JI('ZM6r2MG'), _$JI("Op0YV"), $_12);
request.setRequestHeader(_$JI("RACeXwDYXwcTV8Ur2"), _$JI("9wDYgwceLwDT7iCYX3Ce9FKyvHKwPFa"));
} else {
var $_16 = "";
var $_11 = $_19.elements;
var $_14 = $_11.length;
for (var $_4 = 0; $_4 < $_14; $_4++) {
var $_6 = _$kH($_11, $_4);
if ($_6.type != _$JI("aQ6YPMK20") && _$kH($_6, _$JI('Cwbm7wKV')) != "") {
if ($_16.length > 0) {
$_16 += "&" + $_6.name + "=" + _$kH($_6, _$JI('swbm7wKV'));
} else {
$_16 += $_6.name + "=" + _$kH($_6, _$JI('8wbm7wKV'));
}
$_16 += _$JI("xx2J03Up2Hsl");
}
}
_$b6(request, _$JI('iM6r2MG'), _$JI("IVlesYq"), $_12);
$_16 = encodeURI($_16);
$_16 = encodeURI($_16);
request.setRequestHeader(_$JI("53CmOFDVz3CeXwoxBMq"), _$JI("wMbZz3CmOFDV"));
request.setRequestHeader(_$JI("ZACeXwDYXwcTV8Ur2"), _$JI("F3UraMD2O3UpNMCgB8cT6w6QzRbenM1TTQbS2MbJBRDY9"));
}
request.send($_16);
if ($_19 != null) {
$_19.reset();
}
}
truncated....
function createXMLHttp() {
if (window.XMLHttpRequest) {
return new XMLHttpRequest();
} else if (window.ActiveXObject) {
var $_17 = [_$JI("5sYJ3sVanh2fJslf0woqXJ1ga"), _$JI("osYJ3sVanh2fJslf0woqXJcga"), _$JI("ZsYJ3sVanh2fJslf0woqXWnga"), _$JI("fsYJ3sVanh2fJslf0woq"), _$JI("3sK2OQbeuMCR0h2fJslf0woq")];
for (var $_16 = 0; $_16 < $_17.length; $_16++) {
try {
return new ActiveXObject(_$kH($_17, $_16));
} catch ($_19) {}
}
throw new Error("您的浏览器不支持访问此网页");
}
}
truncated....
function callback() {
if (request.readyState == 1) {
_$_J(document.getElementById(_$JI("x3CeXwDYXwq")), '=', _$JI('3FKyXRUxEYlTW'), _$JI("EHDxnHOaB3vE5HDxnHOSNMKQGQ6xOHK2z3Kw2Qne7MCm9FKyvhbwNROg"));
}
if (request.readyState == 4) {
if (request.status == 200) {
oldContent.length = 0;
oldContent[0] = request.responseText;
_$_J(document.getElementById(_$JI("H3CeXwDYXwq")), '=', _$JI('OFKyXRUxEYlTW'), request.responseText);
request = null;
} else {
_$_J(document.getElementById(_$JI("w3CeXwDYXwq")), '=', _$JI('eFKyXRUxEYlTW'), "<br><br><br><span style=font-size:x-large;color:#215add>请勿频繁操作!</span>");
}
}
}
</code></pre>
<p>我想看看是否有可能获得<code>GBK</code>值,并以某种方式通过Python <code>Requests</code>将其传回</p>
<p>例如,这个代码给我一个状态代码202</p>
<pre><code>import requests
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
}
conn_timeout = 30
read_timeout = 60
timeouts = (conn_timeout, read_timeout)
url = 'http://app1.nmpa.gov.cn/data_nmpa/face3/base.jsp?tableId=27&tableName=TABLE27&title=%E8%BF%9B%E5%8F%A3%E5%8C%BB%E7%96%97%E5%99%A8%E6%A2%B0%E4%BA%A7%E5%93%81%EF%BC%88%E6%B3%A8%E5%86%8C&bcId=152904442584853439006654836900'
response = requests.get(url, headers=headers, timeout=timeouts)
print(response.status_code)
# output
202
print(response.cookies)
# output
<RequestsCookieJar[<Cookie acw_tc=3ccdc15616274084596245338e08543386ba17e301ed12362cb2860b0af57f for app1.nmpa.gov.cn/>, <Cookie neCYtZEjo8GmS=5oWYI0i1mRB70b.XyuRJwTdiW_WgqfsIoOM8LNI8nfdTGyX4kfKTl0TDpV5HSMj2KIpgl8ircG4c9uAz_u50UkG for app1.nmpa.gov.cn/>]>
for key, value in response.headers.items():
print(f'Key: {key} Value: {value}')
# output
Key: Date Value: Tue, 24 Jul 2021 17:54:19 GMT
Key: Content-Type Value: text/html; charset=utf-8
Key: Transfer-Encoding Value: chunked
Key: Connection Value: keep-alive
Key: Set-Cookie Value: acw_tc=3ccdc15616274084596245338e08543386ba17e301ed12362cb2860b0af57f;path=/;HttpOnly;Max-Age=1800, neCYtZEjo8GmS=5oWYI0i1mRB70b.XyuRJwTdiW_WgqfsIoOM8LNI8nfdTGyX4kfKTl0TDpV5HSMj2KIpgl8ircG4c9uAz_u50UkG; Path=/; expires=Fri, 25 Jul 2031 17:50:29 GMT; HttpOnly
Key: Server Value: ******
Key: Pragma Value: no-cache
Key: Cache-Control Value: no-store
Key: Expires Value: Tue, 24 Jul 2021 17:50:29 GMT
</code></pre>
<p>在尝试使用Python <code>Requests</code>访问网站时,收到状态代码202是一个问题,因为在连接关闭之前,请求没有得到完全处理</p>
<blockquote>
<p>202 Accepted</p>
<p>The request has been accepted for processing, but the processing has not been completed. The request might or might not eventually be acted upon, as it might be disallowed when processing actually takes place. There is no facility for re-sending a status code from an asynchronous operation such as this.</p>
<p>The 202 response is intentionally non-committal. Its purpose is to allow a server to accept a request for some other process (perhaps a batch-oriented process that is only run once per day) without requiring that the user agent's connection to the server persist until the process is completed. The entity returned with this response SHOULD include an indication of the request's current status and either a pointer to a status monitor or some estimate of when the user can expect the request to be fulfilled.</p>
</blockquote>
<p>我在浏览器中注意到我得到了这些项目</p>
<p><a href="https://i.stack.imgur.com/Barrh.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/Barrh.png" alt="enter image description here"/></a></p>
<p>我仍在探索如何在Python <code>Requests</code>帖子中传递这些项目,以获取有用的信息</p>
<p><em><strong>我没有考虑过使用<code>selenium,</code>,但在您的问题中,该软件包似乎是从该网站提取动态创建内容的最佳选择。</strong></em></p>