Java Slug创建到Python

代码

Java代码

import java.text.Normalizer; import java.text.Normalizer.Form; import java.util.regex.Pattern; public class Example { private static final Pattern NONLATIN = Pattern.compile("[^\\w-]"); public static String makeSlug(String input) { String normalized = Normalizer.normalize(input, Form.NFD); String noNonlatinNormalized = NONLATIN.matcher(normalized).replaceAll(""); return noNonlatinNormalized; } public static void main(String[] args) { String testString = "thiß-täst"; String slug = makeSlug(testString); String noNormalize = NONLATIN.matcher(testString).replaceAll(""); System.out.println(String.format("Start string \t'%s'", testString)); System.out.println(String.format("Slug creation \t'%s'", slug)); System.out.println(String.format("Without normalize \t'%s'", noNormalize)); } }

Java输出

# Start string 'thiß-täst' # Slug creation 'thi-tast' # Without normalize 'thi-tst'

Python代码

import regex import unidecode NONLATIN = regex.compile("[^[:ascii:]-]") # works better (i.e. closer to Java) than [^\w-] def make_slug(string: str) -> str: unidecoded = unidecode.unidecode(string) no_nonlatin_unidecoded = NONLATIN.sub("", unidecoded) return no_nonlatin_unidecoded if __name__ == "__main__": test_string = "thiß-täst" slug = make_slug(test_string) no_unidecode = NONLATIN.sub("", test_string) print("Start string \t'%s'" % test_string) print("Slug creation \t'%s'" % slug) print("Without unidecode \t'%s'" % no_unidecode)

Python输出

# Start string 'thiß-täst' # Same start string # Slug creation 'thiss-tast' # PROBLEM -> unidecode turns "ß" to "ss" # Without unidecode 'thi-tst' # Regex Java-to-Python translation is OK

注意事项

此外，Java的Normalizer.normalize的行为是独特的：

可以检查Normalizer.normalize("thiß-täst", Form.NFD)是否返回"thiß-täst"

NONLATIN.matcher(normalized).replaceAll("")返回"thi-tast"（由makeSlug返回）

NONLATIN.matcher("thiß-täst").replaceAll("")返回thi-tst（如Java输出所示）

这表明Normalizer.normalize显然有影响，即使它似乎没有触及字符串

另一方面，Python的unidecode.unidecode将"thiß-täst"转换为thiss-tast。将ä转换为a并不成问题，因为Java最终也会这样做。但是，将ß转到ss会导致问题

PS我宁愿避免形式string.replace("ß", "")的快速修复-我的目标是尽可能地坚持Java

1条回答

网友
1楼 · 发布于 2024-07-05 07:45:21

模块unicodedata在这里可能很有趣：
import regex import unicodedata NONLATIN = regex.compile("[^[:ascii:]-]") def make_slug(string: str) -> str: normalized = unicodedata.normalize("NFD", string) slug = NONLATIN.sub("", normalized) return slug if __name__ == "__main__": test_string = "thiß-täst" slug = make_slug(test_string) print("Start string \t'%s'" % test_string, "Slug creation \t'%s'" % slug, sep="\n") # Start string 'thiß-täst' # Slug creation 'thi-tast
我认为可以公平地假设Python的unicodedata.normalize("NFD", string)与Java的Normalizer.normalize(string, Form.NFD)相当相似（或者至少比unidecode.unidecode(string)更接近）

问题

代码

注意事项

相关问题更多 >

编程相关推荐

热门问题

热门文章