使用python extract-specific d在LXML中进行屏幕抓取

网友

1楼 · 编辑于 2024-06-26 10:34:44

如果不需要通过XPath实现，可以使用这样的BeautifilSoup库（让myXml变量包含页面HTML源）：

soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
  # this is your quote
  print a.contents

不管怎样，阅读BS文档，它对于一些不需要XPath功能的抓取需求可能非常有用。在

网友

2楼 · 编辑于 2024-06-26 10:34:44

您可以打开html源代码来找到您要查找的确切类。例如，要获取页面上遇到的第一个StackOverflow用户名，可以执行以下操作：

#!/usr/bin/env python
from lxml import html

url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[@class="user-details"]/a[@href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

网友

3楼 · 编辑于 2024-06-26 10:34:44

import lxml.html
import urllib

site = 'http://thinkexist.com/search/searchquotation.asp'

userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})

root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[@class="sqq"]')

print quotes[0].text_content()

。。。如果你输入“莎士比亚”，它就会返回

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用python extract-specific d在LXML中进行屏幕抓取

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >