不使用newlin在python中解析XML

from lxml import etree def lxml(): tree = etree.parse('feed.xml') NSMAP = {"nn":"http://www.w3.org/2005/Atom"} test = tree.xpath('//nn:category[@term="html"]/..',namespaces=NSMAP) for elem in tree.iter(): print(elem.tag,'\t',elem.attrib) print('-------------------------------') test1 = tree.xpath('//nn:category',namespaces=NSMAP) print('++++++++++++++++++++++++++++++++') for node in test1: test2 = node.xpath('./../nn:summary',namespaces=NSMAP) # return a list print(test2.xpath('normalize-space(.)')) print('*****************************************') test3 = tree.xpath('//text()[normalize-space(.)]')# [normalize-space()] only remove the heading and tailing print(test3)

++++++++++++++++++++++++++++++++ ['Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.'] ['Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.'] ['Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.'] ['The accessibility orthodoxy does not permit people to\n question the value of features that are rarely useful and rarely used.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ['These notes will eventually become part of a\n tech talk on video encoding.'] ***************************************** ['\n ', 'dive into mark', '\n ', 'currently between addictions', '\n ', 'tag:diveintomark.org,2001-07-29:/', '\n ', '2009-03-27T21:56:07Z', '\n ', '\n ', '\n ', '\n ', '\n ', 'Mark', '\n ', 'http://diveintomark.org/', '\n ', '\n ', 'Dive into history, 2009 edition', '\n ', '\n ', 'tag:diveintomark.org,2009-03-27:/archives/20090327172042', '\n ', '2009-03-27T21:56:07Z', '\n ', '2009-03-27T17:20:42Z', '\n ', '\n ', '\n ', '\n ', 'Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.', '\n ', '\n ', '\n ', '\n ', 'Mark', '\n ', 'http://diveintomark.org/', '\n ', '\n ', 'Accessibility is a harsh mistress', '\n ', '\n ', 'tag:diveintomark.org,2009-03-21:/archives/20090321200928', '\n ', '2009-03-22T01:05:37Z', '\n ', '2009-03-21T20:09:28Z', '\n ', '\n ', 'The accessibility orthodoxy does not permit people to\n question the value of features that are rarely useful and rarely used.', '\n ', '\n ', '\n ', '\n ', 'Mark', '\n ', '\n ', 'A gentle introduction to video encoding, part 1: container formats', '\n ', '\n ', 'tag:diveintomark.org,2008-12-18:/archives/20081218155422', '\n ', '2009-01-11T19:39:22Z', '\n ', '2008-12-18T15:54:22Z', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', 'These notes will eventually become part of a\n tech talk on video encoding.', '\n ', '\n']..

2条回答

网友

1楼 · 编辑于 2024-10-06 12:08:20

\n是一个转义序列。你知道吗

您可以检查页面源代码并发现bloated位于新行的开头。你知道吗

要删除它们，可以使用^{}或^{}。你知道吗

网友

2楼 · 编辑于 2024-10-06 12:08:20

"My question is why there are so many '\n'. how to delete them?"

XML中的每个空格都将由XPath选择。格式良好的XML通常包含大量的换行符和空格。例如，在下面的XML中，有两个空文本节点将由//text()选择，即一个在<root>和<foo>之间，另一个在</foo>和</root>之间：

<root>
    <foo>bar</foo>
</root>

您可以使用//text()[normalize-space()]来避免首先选择空文本节点。你知道吗

"additional question is how to directly query the tag of a text, such as make to get the node of "Mark" ( the child of entry's text."

your_text_node.getparent().tag

上面应该获取变量your_text_node引用的文本节点的父元素，然后返回元素的标记名。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章