带符号的java拉丁正则表达式

1 月，1 周 Questions & Answers 9621

我需要拆分一个文本，只得到单词，数字和连字号组成的单词。我还需要得到拉丁语单词，然后我使用了\p{L}，这给了我é，úüã，等等。例如：

String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -" Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+"); String words[] = pattern.split( myText );

这个正则表达式怎么了？为什么它匹配像"("、"+"、"-"、"*"和"|"这样的符号

其中一些结果是：

dresse // OK sud-est // OK occident) // WRONG 987 // OK () // WRONG (a // WRONG * // WRONG - // WRONG + // WRONG ( // WRONG | // WRONG

正则表达式的解释是：

[^\p{L}+(\-\p{L}+)*\d]+ * Word separator will be: * [^ ... ] No sequence in: * \p{L}+ Any latin letter * (\-\p{L}+)* Optionally hyphenated * \d or numbers * [ ... ]+ once or more.

# 2 楼答案

如果我对您的要求理解正确，此正则表达式将符合您的要求：

"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"

它将匹配：

Unicode Latin script字符的连续序列。我将其限制为拉丁文，因为\p{L}将匹配任何脚本中的字母。如果您的Java版本不支持语法，请将\\p{IsLatin}更改为\\pL
或者几个这样的序列，连字符

或连续的十进制数字序列（0-9）

上面的正则表达式将通过调用Pattern.compile来使用，并调用matcher(String input)来获得Matcher对象，并使用循环来查找匹配项

Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"); Matcher matcher = pattern.matcher(inputString); while (matcher.find()) { System.out.println(matcher.group()); }

如果要允许使用带撇号的单词'：

"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"

我还在character类['\\-]中转义-，以防您想添加更多。实际上，如果-是character类中的第一个或最后一个，那么它不需要转义，但我还是转义它，只是为了安全

共 (2) 个答案

# 1 楼答案
如果字符类的开头括号后面跟着^，则不允许在类中列出字符。因此，您的正则表达式允许除unicode字母、+、(、-、)、*和数字出现一次或多次之外的任何内容

注意+、(、)、*等字符在字符类中没有任何特殊意义

什么图案。split的作用是在匹配正则表达式的模式下拆分字符串。正则表达式匹配空格，因此每次出现一个或多个空格时都会发生拆分。结果就是这样
例如，考虑这个
```
Pattern pattern = Pattern.compile("a");
    for (String s : pattern.split("sda  a  f  g")) {
        System.out.println("==>"+s);
    }
```
输出将是

==>sd

==>

==> f g
# 2 楼答案
如果我对您的要求理解正确，此正则表达式将符合您的要求：
```
"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"
```
它将匹配：
- Unicode Latin script字符的连续序列。我将其限制为拉丁文，因为\p{L}将匹配任何脚本中的字母。如果您的Java版本不支持语法，请将\\p{IsLatin}更改为\\pL
- 或者几个这样的序列，连字符
上面的正则表达式将通过调用Pattern.compile来使用，并调用matcher(String input)来获得Matcher对象，并使用循环来查找匹配项

Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"); Matcher matcher = pattern.matcher(inputString); while (matcher.find()) { System.out.println(matcher.group()); }

如果要允许使用带撇号的单词'：

"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"

我还在character类['\\-]中转义-，以防您想添加更多。实际上，如果-是character类中的第一个或最后一个，那么它不需要转义，但我还是转义它，只是为了安全

Python中文网

有 Java 编程相关的问题?

带符号的java拉丁正则表达式

共 (2) 个答案

# 1 楼答案

# 2 楼答案