正则表达式unicode字符不匹配

2024-09-27 17:51:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试在包含诸如a、è、ù等特殊字符的文本上使用正则表达式

filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)

下面的文本是表达式求值的结果。在

[[Pedro Calderón de la Barca|Calderón]], [[Christian Fürchtegott Gellert|Gellert]], [[Oliver Goldsmith|Goldsmith]], [[Hafez]], [[Johann Gottfried Herder|Herder]], [[Homer]], [[Kālidāsa]], [[Kant]], [[Friedrich Gottlieb Klopstock|Klopstock]], [[Gotthold Ephraim Lessing|Lessing]], [[Carl Linnaeus|Linnaeus]], [[James Macpherson|Macpherson]], [[Jean-Jacques Rousseau|Rousseau]], [[Friedrich Schiller|Schiller]], [[William Shakespeare|Shakespeare]], [[Spinoza]], [[Emanuel Swedenborg|Swedenborg]],[[Karl Robert Mandelkow]], Bodo Morawe: Goethes Briefe. 2. edition. Vol. 1: Briefe der Jahre 1764–1786. ''Christian Wegner'', Hamburg 1968, p. 709 [[Johann Joachim Winckelmann|Winckelmann]]`

现在,当我试图在上面的文本上使用另一个正则表达式来推断方括号中的单词时,结果是错误的。所有代表一个特殊字符的单词,比如aùorè,都会被删除,结果并不是预期的结果。在

^{pr2}$

以下是我的结果:

[('Pedro Calder', ''), ('Christian F', ''), ('Oliver Goldsmith', ''), ('Hafez', ''), ('Johann Gottfried Herder', ''), ('Homer', ''), ('K', ''), ('Kant', ''), ('Friedrich Gottlieb Klopstock', ''), ('Gotthold Ephraim Lessing', ''), ('Carl Linnaeus', ''), ('James Macpherson', ''), ('Jean-Jacques Rousseau', ''), ('Friedrich Schiller', ''), ('William Shakespeare', ''), ('Spinoza', ''), ('Emanuel Swedenborg', ''), ('Karl Robert Mandelkow', ''), ('Johann Joachim Winckelmann', ''), ('Thomas Carlyle', ''), ('Ernst Cassirer', ''), ('Charles Darwin', ''), ('Sigmund Freud', ''), ('G', ''), ('Andr', ''), ('Hermann Hesse', ''), ('G.W.F. Hegel', ''), ('Muhammad Iqbal', ''), ('Daisaku Ikeda', ''), ('Carl Gustav Jung', ''), ('Milan Kundera', ''), ('S', ''), ('Jean-Baptiste Lamarck', ''), ('Joaquim Maria Machado de Assis', ''), ('Thomas Mann', ''), ('Friedrich Nietzsche', ''), ('France Pre', ''), ('Grigol Robakidze', ''), ('Friedrich Schiller', ''), ('Oswald Spengler', ''), ('Max Stirner', ''), ('Friedrich Wilhelm Joseph Schelling', ''), ('Arthur Schopenhauer', ''), ('Oswald Spengler', ''), ('Rudolf Steiner', ''), ('Henry David Thoreau', ''), ('Nikola Tesla', ''), ('Ivan Turgenev', ''), ('Ludwig Wittgenstein', ''), ('Richard Wagner', ''), ('Leopold von Ranke', '')]

这些都是我想要达到的结果

MATCH 1 1. [2-28] Pedro Calderón de la Barca MATCH 2 1. [43-72] Christian Fürchtegott Gellert MATCH 3 1. [86-102] Oliver Goldsmith MATCH 4 1. [118-123] Hafez MATCH 5 1. [129-152] Johann Gottfried Herder MATCH 6 1. [165-170] Homer MATCH 7 1. [176-184] Kālidāsa MATCH 8 1. [190-194] Kant MATCH 9 1. [200-228] Friedrich Gottlieb Klopstock MATCH 10 1. [244-268] Gotthold Ephraim Lessing MATCH 11 1. [282-295] Carl Linnaeus MATCH 12 1. [310-326] James Macpherson MATCH 13 1. [343-364] Jean-Jacques Rousseau MATCH 14 1. [379-397] Friedrich Schiller MATCH 15 1. [412-431] William Shakespeare MATCH 16 1. [449-456] Spinoza MATCH 17 1. [462-480] Emanuel Swedenborg MATCH 18 1. [501-522] Karl Robert Mandelkow MATCH 19 1. [659-685] Johann Joachim Winckelmann

所有的正则表达式都是在线测试的,它们工作得很好。有没有办法真正包括这些特殊字符?在


Tags: rematchjeanchristiancarlgoldsmithfriedrichlessing
1条回答
网友
1楼 · 发布于 2024-09-27 17:51:53

在python3中,regex不能编译。当我改变主意时,这似乎对我有用:

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

对于unicode(非原始)字符串:

^{pr2}$

在python2中,我认为问题在于将列表转换为字符串。把str(filter_list)改成{}似乎对我有用。在

相关问题 更多 >

    热门问题