使用Selenium和Xpath进行Web抓取

2024-10-04 01:35:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我是Xpath新手。我正试图从一个股票网站上获取每个元素的名称和值。 在我的python selenium脚本中,我在html_内容中本地提取了网页的主要部分,如下所示

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
dirinstall="C:\\Program Files (x86)\\www\mm\\"
chrome_driver = dirinstall+"\\Webdriver\\chromedriver.exe"
options = Options()
driver = webdriver.Chrome(chrome_driver, options=options)
html_content = """
<html class="ng-scope">
<head data-meta-tags="">
    <title> Stock NYSE </title>
    <ui-layout class="ng-isolate-scope">
        <div data-ng-include="" src="layoutCtrl.template" class="ng-scope">
            <app-root class="ng-scope" _nghost-rqp-c0="" ng-version="8.2.14"></app-root>
            <div ng-class="{'demo-mode': $root.session.user.portfolio.account.type === 'Demo' }" class="ng-scope">
                <div ng-view="" ng-class="layoutCtrl.isBannerShown ? 'banner-shown' : ''" class="main-app-view ng-scope" role="main">
                    <et-discovery-markets-results class="ng-scope" _nghost-rqp-c42="" ng-version="8.2.14">
                        <div _ngcontent-rqp-c42="" class="discover main-content no-footer" ui-fun-scroll="{'class': 'minimize', 'classEl': '.user-head-wrapper, .table-discover', 'scrollContainer': '.table-discover', 'setClassAtScroll': 200 }">
                            <div _ngcontent-rqp-c42="" automation-id="discover-market-results-wrapp" class="table-discover markets-table">
                                <et-discovery-markets-results-list _ngcontent-rqp-c42="" automation-id="discover-market-results-sub-view-list" _nghost-rqp-c44="" class="ng-star-inserted">
                                    <div _ngcontent-rqp-c44="" class="market-list list-view" data-etoro-locale-ns="discoverMarketResultsList">
                                        <et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
                                            <et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
                                                <div _ngcontent-rqp-c47="" class="row-wrap">
                                                    <div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
                                                        <div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
                                                        <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">A</div>
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name positive"> 0.68 (0.90%) </div>
                                                        </div>
                                                    </div>
                                                    <et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label positive-change" automation-id="buy-sell-button-container-sell">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">75.<span class="after-decimal">85</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                        <div _ngcontent-rqp-c24="" class="space-gap"></div>
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">76.<span class="after-decimal">03</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                    </et-buy-sell-buttons>
                                                </div>
                                                <et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
                                                </et-trade-item-card-action>
                                            </et-instrument-trading-mobile-row>
                                        </et-instrument-mobile-row>
                                        <et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
                                            <et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
                                                <div _ngcontent-rqp-c47="" class="row-wrap">
                                                    <div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
                                                        <div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
                                                        <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">AA</div>
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name negative"> -0.11 (-1.46%) </div>
                                                        </div>
                                                    </div>
                                                    <et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-sell">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">44</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                        <div _ngcontent-rqp-c24="" class="space-gap"></div>
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">47</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                    </et-buy-sell-buttons>
                                                </div>
                                                <et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
                                                </et-trade-item-card-action>
                                            </et-instrument-trading-mobile-row>
                                        </et-instrument-mobile-row>
                                    </div>
                                </et-discovery-markets-results-list>
                            </div>
                        </div>
                    </et-discovery-markets-results>
                </div>
            </div>
        </div>
    </ui-layout>
    </body>

</html>
"""

driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
#results = driver.find_elements_by_xpath("//*[@class='ng-star-inserted']")
results = driver.find_elements_by_xpath("//*[et-instrument-mobile-row and @class='ng-star-inserted']")
print('Number of results', len(results))

我不知道为什么如果搜索“et instrument mobile row”,我只会得到1个元素而不是2个,如果同时搜索“et instrument mobile row”和“ng star inserted”,我会得到0个元素。 看看这个例子,我的目标是获得买入/卖出的符号和当前值(价格和小数点后)

比如:

[A,75.85,76.03]

[AA,7.44,7.47]

有人能帮我吗?谢谢


Tags: dividbuttonbuyngclassetautomation
1条回答
网友
1楼 · 发布于 2024-10-04 01:35:18

看起来您可能有一些格式错误的HTML,Selenium不确定如何解析它。我注意到这句话:

 <div _ngcontent-rqp-c47="" class="avatar-img-wrap"><img _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-avatar" class="avatar-img" src="https://etoro-cdn.etorostatic.com/market-avatars/a/150x150.png" alt="Agilent Technologies Inc">

<img>标记未关闭。您可以看到,这里的语法突出显示也会混淆

否则,您搜索的XPath通常格式良好。

编辑:仔细查看。属性名应该在*所在的位置。 这是您的XPath:

"//et-instrument-mobile-row[@class='ng-star-inserted']"

编辑2:Asker还有一个问题,关于如何在上面用XPath找到的内容中进行搜索。

为了在这些元素中找到更多元素,请查看the documentation,每个SeleniumWebElement都提供了自己的find_element方法。然后,您可以使用这些元素在我们刚刚找到的元素中进行进一步搜索(请确保在XPath中使用.//,因为您只想遍历该特定元素的内容-其他find_元素没有此警告)

一旦确定了包含符号和价格的元素,就可以简单地引用这些元素上的text属性。让我们看一个更简单的例子:

<div class="a">
  <div class="b" id="1">B</div>
  <div class="c" id="2">2</div>
  <div class="d" id="3">22</div>
</div>

假设我们已经在这里找到了根div,并将其存储在名为element的变量中。然后:

symbol = element.find_element_by_xpath(".//*[@class='b']").text
integral = element.find_element_by_xpath(".//*[@class='c']").text
fractional = element.find_element_by_xpath(".//*[@class='d']").text

不过,一般来说,如果您可以通过XPath以外的其他方式进行搜索,那么对所有相关人员来说都会更容易。下面是一种更典型的使用类名实现此目的的方法:

symbol = element.find_element_by_class_name("b").text
integral = element.find_element_by_class_name("c").text
fractional = element.find_element_by_class_name("d").text

编辑3:作者注释

在@firstbass的宝贵帮助下,我深入了解了symbol和不同的买卖价格,如下所示:

for element in results:
    symbol = element.find_element_by_xpath(".//*[@class='symbol']").text
    print(str(symbol))
    sell = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[@automation-id='buy-sell-button-container-sell']")
    sell_integral = sell.find_element_by_xpath(".//*[@class='price']").text
    sell_fractional = sell.find_element_by_xpath(".//*[@class='after-decimal']").text
    print(str(sell_integral)+':'+str(sell_fractional))
    buy = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[@automation-id='buy-sell-button-container-buy']")
    buy_integral = buy.find_element_by_xpath(".//*[@class='price']").text
    buy_fractional = buy.find_element_by_xpath(".//*[@class='after-decimal']").text
    print(str(buy_integral)+':'+str(buy_fractional))

相关问题 更多 >