回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>如果您能帮助我改进代码并解决以下两个问题,那就太好了:</p>
<ul>
<li>进程<code>['DE101096','AT231']</code>开始时的一个id被忽略</li>
<li>当爬行与刮蹭机器人-o结果.csv结果为csv,格式如下:</li>
</ul>
<p>交易id交易日期交易id交易id交易id</p>
<p>DE101096 2011-02-21 11:05:23.312<br/>
DE101096 2011-02-21 11:05:23.312氧燃料分析<br/>
DE101096 2011-02-21 11:05:23.312 Anlagenkonto Oxyfuelanlage N-Ionalkonto–Ausgabe</p>
<p>很明显,我希望只有一行具有transactionID、transactionDate、acq_id和tra_id。我知道这个问题显然取决于我的代码,该代码将transactionID、transactionDate传递给由于两个后续请求而重复的项。然而,我找不到任何接近预期产量的解决方案。在</p>
<p>我如何解决上述问题,以及如何使我的蜘蛛更有效。我也尝试了一种基于规则的方法,但是没有成功。在</p>
<p>我很高兴所有的投入!在</p>
<pre><code>import csv
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.http import FormRequest, Request
from etsbot.items import TransactionItem
from etsbot.middlewares import RandomProxy
class EuetsbotdetSpider(CrawlSpider):
name = 'euetsbotdet'
allowed_domains = ['ec.europa.eu']
start_urls = [
'http://ec.europa.eu/environment/ets/transaction.do'
]
def parse(self, response):
#self.data = csv.DictReader(open('/home/...t/items.csv','r'))
#self.tids = []
#for self.row in self.data:
# self.tids.append(self.row['transactionID'])
self.tids = ['DE101096','AT231']
for self.id in self.tids:
return FormRequest.from_response(
response,
formname='transactions_maxlength',
formdata={'transactionID':self.id},
clickdata={'name': 'search'},callback=self.parseLinks
)
def parseLinks(self,response):
lex = LinkExtractor(allow=('http://ec.europa.eu/environment/ets/singleTransaction.do',),unique=True)
for l in lex.extract_links(response):
yield Request(l.url,method='GET',callback=self.parseDetail,)
def parseDetail(self,response):
sel = Selector(response)
item = TransactionItem()
item['transactionID'] = sel.xpath('//table/tr/td/input[@name="transactionID"]/@value').extract()
item['transactionDate'] = sel.xpath('//table/tr/td/input[@name="transactionDate"]/@value').extract()
lext = LinkExtractor(unique=True,restrict_xpaths = ('//*[@id="tblTransactionBlocksInformation"]/tr/td[6]/a[@class="resultlink"]'),)
for l in lext.extract_links(response):
yield Request(l.url,method='GET',meta={'item':item},callback=self.parseAccounttr)
lexa = LinkExtractor(unique=True,restrict_xpaths = ('//*[@id="tblTransactionBlocksInformation"]/tr/td[7]/a[@class="resultlink"]'),)
for l in lexa.extract_links(response):
yield Request(l.url,method='GET',meta={'item':item},callback=self.parseAccountac)
yield item
def parseAccounttr(self,response):
sel = Selector(response)
item = response.meta['item']
item['tra_id'] = sel.xpath('//*[@id="tblAccountInfoReadonly"]/tr/td/input[@name="identifierInReg"]/@value').extract()
yield item
def parseAccountac(self,response):
sel = Selector(response)
item = response.meta['item']
item['acq_id'] = sel.xpath('//*[@id="tblAccountInfoReadonly"]/tr/td/input[@name="identifierInReg"]/@value').extract()
yield item
</code></pre>
<p><strong>编辑:</strong></p>
<p>在paultrmbth的精彩评论的帮助下,我重写了我的代码。我没有像我在上面的代码中那样将下载分成两组,而是在一个流中进行所有的下载。这意味着当我爬的时候蜘蛛网.py-o对于每个transactionID/transactionDate,我有两行,第一行是“卖方”,第二行是“买方”。显然,这些信息应该放在一行。我现在的想法是在后处理中自动更正这一点,即通过transactionID/transactionDate将每个奇数项与后续的偶数项合并(我希望这是清楚的)。但我该怎么做呢?在</p>
^{pr2}$