带有Xpath分组的Python Scrapy动态项

2024-09-30 00:25:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我的页面如下

<div style="width:100%;" id="innerTSpec"> <table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth"> <tr><td ></td><td class="techspecheading"> Header1</td></tr> <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr> <tr><td ></td><td class="techspecheading"> </td></tr> <tr><td ></td><td class="techspecdata"> My Attribute1: </td><td width="10px"></td><td class="techspecdata"> Value1 </td></tr> <tr><td ></td><td class="techspecheading"> </td></tr> <tr><td ></td><td class="techspecdata"> My Attribute2: </td><td width="10px"></td><td class="techspecdata"> Value2 </td></tr> <tr><td ></td><td class="techspecheading"> </td></tr> ---> <tr><td ></td><td class="techspecheading"> <hr></td></tr> <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr> <tr><td ></td><td class="techspecheading"> Header2</td></tr> <tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr> <tr><td ></td><td class="techspecheading"> </td></tr> <tr><td ></td><td class="techspecdata"> My Attribute3: </td><td width="10px"></td><td class="techspecdata"> More Value1 </td></tr> <tr><td ></td><td class="techspecheading"> </td></tr> <tr><td ></td><td class="techspecdata"> My Attribute4: </td><td width="10px"></td><td class="techspecdata"> More Value2 </td></tr> <tr><td ></td><td class="techspecheading"> </td></tr> <tr><td ></td><td class="techspecdata"> My Attribute5: </td><td width="10px"></td><td class="techspecdata"> More Value3 </td></tr> ---> <tr><td ></td><td class="techspecheading"> <hr></td></tr> </table> </div>

标题和属性不是固定的位置,它会随着页面的变化而变化。 我试着做如下:

Header1 | Header2 |... ---------------------------------------------- My Attribute1:Value1|My Attribute3:More Value1|... My Attribute2:Value2|My Attribute4:More Value2|... |My Attribute5:More Value3|...

注:我使用的动态项目将添加如下

My Item is as below -------------------------------------- class Website(Item): def __setitem__(self, key, value): if key not in self.fields: self.fields[key] = Field() self._values[key] = value -------------------------------------- and in spider adding as below -------------------------------------- item[Heading]=Body.xpath('..........').extract()

Tags: keyselfdivmymoretable页面width
1条回答
网友
1楼 · 发布于 2024-09-30 00:25:01

我没有安装scrapy,但是我认为您可以很容易地修改它以使用scrapy的Items。你知道吗

from lxml.html import fromstring


html = """
<div style="width:100%;" id="innerTSpec">
        <table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth">
            <tr><td ></td><td  class="techspecheading">    Header1</td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute1: </td><td width="10px"></td><td class="techspecdata">    Value1    </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute2: </td><td width="10px"></td><td class="techspecdata">    Value2     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
 ->        <tr><td ></td><td  class="techspecheading">    <hr></td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">   Header2</td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute3: </td><td width="10px"></td><td class="techspecdata">    More Value1     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute4: </td><td width="10px"></td><td class="techspecdata">    More Value2     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">   My Attribute5: </td><td width="10px"></td><td class="techspecdata">    More Value3     </td></tr>
 ->        <tr><td ></td><td  class="techspecheading">    <hr></td></tr>
        </table>
    </div>
"""
body = fromstring(html)

heading = None
item = {}
for tr in body.xpath(r'//div[@id="innerTSpec"]//tr'):
    # Extract row data. Skip rows without data.
    data = tr.xpath(r'.//td[@class]/text()')
    data = list(filter(None, [txt.strip() for txt in data]))
    if not data:
        continue

    # Populate item.  
    if len(data) == 1:
        heading = data[0]
    else:
        item.setdefault(heading, []).append(''.join(data))
print(item)

item

{
    'Header1': ['My Attribute1:Value1', 'My Attribute2:Value2'],
    'Header2': ['My Attribute3:More Value1', 'My Attribute4:More Value2', 'My Attribute5:More Value3']
}

相关问题 更多 >

    热门问题