我们是python初学者。 我们有唐纳德·特朗普话语的链接/网站列表。每个链接代表一个完整的采访/演讲等。我们现在要访问这些网站,刮他们,并为每个链接创建一个文本文件。目前,我们的代码对2或3个链接执行此操作,但仅显示以下错误:
Traceback (most recent call last):
File "C:\Users\Lotte\AppData\Local\Programs\Python\Python37\Code\Corpus_create\Scrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range
我们尝试使用index元素,尝试使用[0]甚至不使用它。什么都没用。然后,我们尝试只使用一个链接而不使用第一个循环来运行代码,这非常有效
import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']
for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)
#loads everything trump said
Text=[]
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)
print(Text, '\n')
我们只想从每一个环节的发言
这是你剧本的修改版本。你知道吗
代码.py:
注意事项:
问题是3rdURL与其他URL稍有不同,如果您查看它,它没有YouTube,因此xpath不匹配。再加上缺少空列表测试,产生了上述异常。现在,正在尝试两种模式:
当一个模式触发某些结果时,只需忽略其余的(如果有的话)
输出(显示每个URL的文章计数):
相关问题 更多 >
编程相关推荐