我尝试在包含诸如a、è、ù等特殊字符的文本上使用正则表达式
filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)
下面的文本是表达式求值的结果。在
[[Pedro Calderón de la Barca|Calderón]], [[Christian Fürchtegott Gellert|Gellert]], [[Oliver Goldsmith|Goldsmith]], [[Hafez]], [[Johann Gottfried Herder|Herder]], [[Homer]], [[Kālidāsa]], [[Kant]], [[Friedrich Gottlieb Klopstock|Klopstock]], [[Gotthold Ephraim Lessing|Lessing]], [[Carl Linnaeus|Linnaeus]], [[James Macpherson|Macpherson]], [[Jean-Jacques Rousseau|Rousseau]], [[Friedrich Schiller|Schiller]], [[William Shakespeare|Shakespeare]], [[Spinoza]], [[Emanuel Swedenborg|Swedenborg]],[[Karl Robert Mandelkow]], Bodo Morawe: Goethes Briefe. 2. edition. Vol. 1: Briefe der Jahre 1764–1786. ''Christian Wegner'', Hamburg 1968, p. 709 [[Johann Joachim Winckelmann|Winckelmann]]`
现在,当我试图在上面的文本上使用另一个正则表达式来推断方括号中的单词时,结果是错误的。所有代表一个特殊字符的单词,比如aùorè,都会被删除,结果并不是预期的结果。在
^{pr2}$以下是我的结果:
[('Pedro Calder', ''), ('Christian F', ''), ('Oliver Goldsmith', ''), ('Hafez', ''), ('Johann Gottfried Herder', ''), ('Homer', ''), ('K', ''), ('Kant', ''), ('Friedrich Gottlieb Klopstock', ''), ('Gotthold Ephraim Lessing', ''), ('Carl Linnaeus', ''), ('James Macpherson', ''), ('Jean-Jacques Rousseau', ''), ('Friedrich Schiller', ''), ('William Shakespeare', ''), ('Spinoza', ''), ('Emanuel Swedenborg', ''), ('Karl Robert Mandelkow', ''), ('Johann Joachim Winckelmann', ''), ('Thomas Carlyle', ''), ('Ernst Cassirer', ''), ('Charles Darwin', ''), ('Sigmund Freud', ''), ('G', ''), ('Andr', ''), ('Hermann Hesse', ''), ('G.W.F. Hegel', ''), ('Muhammad Iqbal', ''), ('Daisaku Ikeda', ''), ('Carl Gustav Jung', ''), ('Milan Kundera', ''), ('S', ''), ('Jean-Baptiste Lamarck', ''), ('Joaquim Maria Machado de Assis', ''), ('Thomas Mann', ''), ('Friedrich Nietzsche', ''), ('France Pre', ''), ('Grigol Robakidze', ''), ('Friedrich Schiller', ''), ('Oswald Spengler', ''), ('Max Stirner', ''), ('Friedrich Wilhelm Joseph Schelling', ''), ('Arthur Schopenhauer', ''), ('Oswald Spengler', ''), ('Rudolf Steiner', ''), ('Henry David Thoreau', ''), ('Nikola Tesla', ''), ('Ivan Turgenev', ''), ('Ludwig Wittgenstein', ''), ('Richard Wagner', ''), ('Leopold von Ranke', '')]
这些都是我想要达到的结果
MATCH 1 1. [2-28]
Pedro Calderón de la Barca
MATCH 2 1. [43-72]Christian Fürchtegott Gellert
MATCH 3 1. [86-102]Oliver Goldsmith
MATCH 4 1. [118-123]Hafez
MATCH 5 1. [129-152]Johann Gottfried Herder
MATCH 6 1. [165-170]Homer
MATCH 7 1. [176-184]Kālidāsa
MATCH 8 1. [190-194]Kant
MATCH 9 1. [200-228]Friedrich Gottlieb Klopstock
MATCH 10 1. [244-268]Gotthold Ephraim Lessing
MATCH 11 1. [282-295]Carl Linnaeus
MATCH 12 1. [310-326]James Macpherson
MATCH 13 1. [343-364]Jean-Jacques Rousseau
MATCH 14 1. [379-397]Friedrich Schiller
MATCH 15 1. [412-431]William Shakespeare
MATCH 16 1. [449-456]Spinoza
MATCH 17 1. [462-480]Emanuel Swedenborg
MATCH 18 1. [501-522]Karl Robert Mandelkow
MATCH 19 1. [659-685]Johann Joachim Winckelmann
所有的正则表达式都是在线测试的,它们工作得很好。有没有办法真正包括这些特殊字符?在
在python3中,regex不能编译。当我改变主意时,这似乎对我有用:
对于unicode(非原始)字符串:
^{pr2}$在python2中,我认为问题在于将列表转换为字符串。把}似乎对我有用。在
str(filter_list)
改成{相关问题 更多 >
编程相关推荐