Selenium对paginator的迭代：优化

while True: propiedades = driver.find_elements_by_xpath("//*[@class='hlisting']") info_propiedades = [propiedad.find_element_by_xpath(".//*[@class='propertyInfo item']") for propiedad in propiedades] for propiedad in info_propiedades: try: link = [l.get_attribute("href") for l in propiedad.find_elements_by_xpath(".//a")] thelink = link[0] id_ = thelink.split("id-")[-1] with open(os.path.join(linkspath, id_), "w") as f: f.write(link[0]) numlinks += 1 except: print("link not found") siguiente = driver.find_element_by_id("paginador_pagina_{0}".format(paginador)) siguiente.click() # goes to the next page while new_active_page == old_active_page: # checks if page has loaded completely try: new_active_page = driver.find_element_by_class_name("pagina_activa").text except: new_active_page = old_active_page time.sleep(0.3) old_active_page = new_active_page paginador += 1

1条回答

网友

1楼 · 发布于 2024-09-30 01:32:15

一些建议。。。你知道吗

一开始有很多嵌套的.find_elements_*。你应该能够手工艺一个单一的发现，得到你正在寻找的元素。从网站和你的代码，它看起来像你得到的代码，看起来像“MC1595226”。如果你抓取这些MC代码中的一个并在HTML中搜索，你会在这个特定的列表中找到这些代码。它在URL中，是一堆元素的id的一部分，等等。查找此代码的更快方法是使用CSS选择器"a[id^='btnContactResultados_'"。它搜索A标记，这些标记包含以“btnContactResultados”开头的id。该id的其余部分是MC号码，例如
```
<a id="btnContactResultados_MC1595226" ...>
```
因此，使用CSS选择器，我们找到所需的元素，然后抓取ID并用“\”分割它，然后抓取最后一部分。注意：这更多的是提高代码效率。我不认为这将使你的脚本去超快，但它应该加快一些搜索部分。
我建议每页写一个日志，每页只写一次。所以基本上你要处理页面的代码，并将结果附加到一个列表中。处理完页面的所有代码后，将该列表写入日志。写入磁盘速度慢。。。你应该尽量少做。最后，您可以编写一个小脚本来打开所有这些文件并附加它们，从而在一个文件中获得最终产品。你也可以做一些折衷的事情，你每页只写一次文件，但在关闭文件并使用另一个文件之前，你要先写100页。您必须使用这些设置才能看到您在何处获得最佳性能。

如果我们把这两者的逻辑结合起来，我们会得到这样的结果。。。你知道吗

while True:
    links = driver.find_elements_by_css_selector("a[id^='btnContactResultados_'")

    codes = []
    for link in links:
        codes.append(link.get_attribute("id").split("_")[-1])

    with open(os.path.join(linkspath, paginador), "w") as f:
        f.write(codes)
    driver.find_element_by_link_text("Siguiente »").click()  # this should work

    while new_active_page == old_active_page:  # checks if page has loaded completely
        try:
            new_active_page = driver.find_element_by_class_name("pagina_activa").text
        except:
            new_active_page = old_active_page
        time.sleep(0.3)
    old_active_page = new_active_page
    paginador += 1

注意：python不是我的母语。。。我更喜欢Java/C，所以您可能会在这里发现错误、低效或非pythony代码。你被警告过。。。：）

相关问题更多 >

编程相关推荐

热门问题

热门文章