在Python中使用Beautifulsoup遍历xml中的非href链接，并检索特定信息

from bs4 import BeautifulSoup import requests import re resultsdict = {} companyname = [] url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml' html = requests.get(url1).text bs = BeautifulSoup(html) # find the links to companies company_menu = bs.find_all('loc') for company in company_menu: print company.contents

2条回答

网友

1楼 · 编辑于 2024-09-30 12:18:47

没有必要为此使用beauthoulsoup。该站点返回的是完全有效的XML，可以使用Python包含的工具进行解析：

import requests
import xml.etree.ElementTree as et

req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
    print i[0].text  # the <loc> text

网友

2楼 · 编辑于 2024-09-30 12:18:47

根据您的请求，您希望从xml获取url，但您正在寻找格式化xml的css标记。。。走错了路。在

试试这个：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2 
from BeautifulSoup import BeautifulSoup

url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'

f = urllib2.urlopen(url1)

bs = BeautifulSoup(f)

for url in bs.findAll("loc"):
    print url.string

请注意，我使用的是findAll（）方法，并查找“loc”标记，其中包含要检索的数据。在

相关问题更多 >

编程相关推荐

热门问题

热门文章