字符串的Java正则表达式

2 年 Questions & Answers 201

我想解析字符串以从中获取字段。字符串（来自数据集）的格式如下（->；表示选项卡，*表示空格）：

Date(yyyymmdd)->Date(yyyymmdd)->*City,*State*-->Description

我只对第一次约会和状态感兴趣。我试过这样的正则表达式：

String txt="19951010 19951011 Red City, WI Description"; String re1="(\\d+)"; // Integer Number 1 String re2=".*?"; // Non-greedy match on filler String re3="(?:[a-z][a-z]+)"; // Uninteresting: word String re4=".*?"; // Non-greedy match on filler String re5="(?:[a-z][a-z]+)"; // Uninteresting: word String re6=".*?"; // Non-greedy match on filler String re7="((?:[a-z][a-z]+))"; // Word 1 Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6+re7,Pattern.CASE_INSENSITIVE | Pattern.DOTALL); Matcher m = p.matcher(txt); if (m.find()) { String int1=m.group(1); String word1=m.group(2); System.out.print("("+int1.toString()+")"+"("+word1.toString()+")"+"\n"); }

如果城市有两个单词（红色城市），那么状态将被正确提取，但是如果城市只有一个单词，它就不起作用。我想不出来，我不需要使用正则表达式，我愿意接受任何其他建议。谢谢

# 1 楼答案

问题：

您的问题是，当前正则表达式的每个组成部分基本上都匹配一个数字或[a-z]字，由任何非[a-z]的内容分隔，包括逗号。因此，对于一个两个词组成的城市，你的部分是：

Input: 
  19951010 19951011 Red City, WI Description

Your components:
  String re1="(\\d+)";    // Integer Number 1
  String re2=".*?";   // Non-greedy match on filler
  String re3="(?:[a-z][a-z]+)";   // Uninteresting: word
  String re4=".*?";   // Non-greedy match on filler
  String re5="(?:[a-z][a-z]+)";   // Uninteresting: word
  String re6=".*?";   // Non-greedy match on filler
  String re7="((?:[a-z][a-z]+))"; // Word 1

What they match:
  re1: "19951010"
  re2: " 19951011 "
  re3: "Red" (stops at non-letter, e.g. whitespace)
  re4: " "
  re5: "City" (stops at non-letter, e.g. the comma)
  re6: ", " (stops at word character)
  re7: "WI"

但用一个词来形容城市：

Input: 
  19951010 19951011 Pittsburgh, PA Description

What they match:
  re1: "19951010"
  re2: " 19951011 "
  re3: "Pittsburgh" (stops at non-letter, e.g. the comma)
  re4: ","
  re5: "PA" (stops at non-letter, e.g. whitespace)
  re6: " " (stops at word character)
  re7: "Description" (but you want this to be the state)

解决方案：

你应该做两件事。首先，简化你的正则表达式；你正在疯狂地指定贪婪与不情愿，等等。只需使用贪婪模式。第二，想一想表达规则的最简单方式

你的规则是：

日期
一堆不是逗号的字符（包括第二个日期和城市名称）
逗号
陈述（一个词）

所以，建立一个遵循这一点的正则表达式。你可以像现在这样，通过跳过第二个数字走捷径，但请注意，你确实会失去对以数字开头的城市的支持（这可能不会发生）。你也不关心国家。例如：

String re1 = "(\\d+)";   // match first number
String re2 = "[^,]*";    // skip everything thats not a comma
String re3 = ",";        // skip the comma
String re4 = "[\\s]*";   // skip whitespace
String re5 = "([a-z]+)"; // match letters (state)

String regex = re1 + re2 + re3 + re4 + re5;

还有其他选择，但我个人认为正则表达式对于这样的事情非常简单。你可以使用split()的各种组合，正如其他海报所详述的那样。您可以直接用indexOf()查找逗号和空格，然后拉出子字符串。你甚至可以说服Scanner或StringTokenizer或StreamTokenizer为你工作。然而，正则表达式可以解决这样的问题，是一个很好的工具

下面是一个StringTokenizer的例子：

StringTokenizer t = new StringTokenizer(txt, " \t");
String date = t.nextToken();
t.nextToken(); // skip second date
t.nextToken(","); // change delimiter to comma and skip city
t.nextToken(" \t"); // back to whitespace and skip comma
String state = t.nextToken();

不过，我觉得正则表达式更清晰地表达了规则

顺便说一句，对于将来的调试，有时候只打印出所有捕获组是有帮助的，这可以让您深入了解匹配的内容。一个好方法是将正则表达式的每个部分临时放入一个捕获组，然后将它们全部打印出来

共 (1) 个答案

# 1 楼答案
问题：

您的问题是，当前正则表达式的每个组成部分基本上都匹配一个数字或[a-z]字，由任何非[a-z]的内容分隔，包括逗号。因此，对于一个两个词组成的城市，你的部分是：
```
Input: 
  19951010 19951011 Red City, WI Description

Your components:
  String re1="(\\d+)";    // Integer Number 1
  String re2=".*?";   // Non-greedy match on filler
  String re3="(?:[a-z][a-z]+)";   // Uninteresting: word
  String re4=".*?";   // Non-greedy match on filler
  String re5="(?:[a-z][a-z]+)";   // Uninteresting: word
  String re6=".*?";   // Non-greedy match on filler
  String re7="((?:[a-z][a-z]+))"; // Word 1

What they match:
  re1: "19951010"
  re2: " 19951011 "
  re3: "Red" (stops at non-letter, e.g. whitespace)
  re4: " "
  re5: "City" (stops at non-letter, e.g. the comma)
  re6: ", " (stops at word character)
  re7: "WI"
```
但用一个词来形容城市：
```
Input: 
  19951010 19951011 Pittsburgh, PA Description

What they match:
  re1: "19951010"
  re2: " 19951011 "
  re3: "Pittsburgh" (stops at non-letter, e.g. the comma)
  re4: ","
  re5: "PA" (stops at non-letter, e.g. whitespace)
  re6: " " (stops at word character)
  re7: "Description" (but you want this to be the state)
```
解决方案：
你应该做两件事。首先，简化你的正则表达式；你正在疯狂地指定贪婪与不情愿，等等。只需使用贪婪模式。第二，想一想表达规则的最简单方式

你的规则是：
- 日期
- 一堆不是逗号的字符（包括第二个日期和城市名称）
- 逗号
- 陈述（一个词）
所以，建立一个遵循这一点的正则表达式。你可以像现在这样，通过跳过第二个数字走捷径，但请注意，你确实会失去对以数字开头的城市的支持（这可能不会发生）。你也不关心国家。例如：
```
String re1 = "(\\d+)";   // match first number
String re2 = "[^,]*";    // skip everything thats not a comma
String re3 = ",";        // skip the comma
String re4 = "[\\s]*";   // skip whitespace
String re5 = "([a-z]+)"; // match letters (state)

String regex = re1 + re2 + re3 + re4 + re5;
```
还有其他选择，但我个人认为正则表达式对于这样的事情非常简单。你可以使用split()的各种组合，正如其他海报所详述的那样。您可以直接用indexOf()查找逗号和空格，然后拉出子字符串。你甚至可以说服Scanner或StringTokenizer或StreamTokenizer为你工作。然而，正则表达式可以解决这样的问题，是一个很好的工具

下面是一个StringTokenizer的例子：
```
StringTokenizer t = new StringTokenizer(txt, " \t");
String date = t.nextToken();
t.nextToken(); // skip second date
t.nextToken(","); // change delimiter to comma and skip city
t.nextToken(" \t"); // back to whitespace and skip comma
String state = t.nextToken();
```
不过，我觉得正则表达式更清晰地表达了规则

顺便说一句，对于将来的调试，有时候只打印出所有捕获组是有帮助的，这可以让您深入了解匹配的内容。一个好方法是将正则表达式的每个部分临时放入一个捕获组，然后将它们全部打印出来

Python中文网

有 Java 编程相关的问题?

字符串的Java正则表达式

共 (1) 个答案

# 1 楼答案