在用正则表达式解析某些法规时处于停顿状态

2条回答

网友

1楼 · 编辑于 2024-09-29 23:28:35

如果你需要一个复杂的正则表达式，一步一步地构建它是很重要的。那是避免迷路的唯一方法。你知道吗

开始前注意两个问题：

我不熟悉法律术语。我的术语可能全错了。
我将使用verbose flag。有了这个标志，您可以在正则表达式中自由地插入空格，以提高可读性。

让我们从法规号开始，定义一个解析单个组件的正则表达式（例如34A或83.1）。你知道吗

nbr = r'\d+ (?: \. \d+ )? [A-Z]?'

其中三到五个组成部分，用破折号隔开，构成一个完整的法规编号。你知道吗

statute = r'%(nbr)s (?: - %(nbr)s ){2,4}' % {
    'nbr': nbr
}

有了这个，我们就可以定义一个既匹配一个规约又匹配一个范围的正则表达式。我们用两个小组来收集法规。第二个将是空的，因为没有给出范围。你知道吗

statute_or_range = r'(%(statute)s) (?: \s+ to \s+ (%(statute)s) )?' % {
    'statute': statute
}

现在我们可以构造一个模式来匹配整个第一行。在这一点上，很容易处理有时出现的逗号。你知道吗

action = r'(?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed)'

first_line = r'%(statute_or_range)s ,? \s+ %(action)s \. \s+' %{
    'statute_or_range': statute_or_range,
    'action': action
}

我不太清楚你要匹配多少文本。我的印象是，你想捕捉到下一个法令的开头，这是一个以法令编号开始的行。所以：

end = r'(?= \n %(statute)s )' % {
    'statute': statute
}

将这些组合起来，就可以得到正则表达式：

pattern = r'%(first_line)s (.*?) %(end)s' % {
    'first_line': first_line,
    'end': end
}

regex = re.compile(pattern, re.VERBOSE | re.DOTALL | re.IGNORECASE)

See it in action.

网友

2楼 · 编辑于 2024-09-29 23:28:35

示例文本：

34A-6-87.1 Disposal of tire waste Collection or processing sites Penalties for violations.
     34A-6-87.1. Disposal of tire waste Collection or processing sites Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
Source:
  SL 1998, ch 202, § 1.          Source:

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88, Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88 to 23-34-1A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-6-89-34 Scale device required Records Report Contents Permit for longer capacity disposal.
     34A-6-89. Scale device required Records Report Contents Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
Source:
  SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:

我假设你想把这段文字分成块，用引用的法规隔开。你知道吗

如果是这样，简化regex。你可以做：

'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?\s+.*?(?=\n\n|\n+\Z|\Z))'

^ assert position at start of a line
1st Capturing group (\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))
\d+ match a digit [0-9]
\w+ match any word character [a-zA-Z0-9_]
- matches the character - literally
\d+ match a digit [0-9]
- matches the character - literally
\d+ match a digit [0-9]
(?:[,.\-0-9A-Z]+)? Non-capturing group
[ \t]+ match a single character present in the list below
.*? matches any character
(?=\n\n|\n+\Z|\Z) Positive Lookahead - Assert that the regex below can be matched
1st Alternative: \n\n
2nd Alternative: \n+\Z
3rd Alternative: \Z
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
s modifier: single line. Dot matches newline characters

注：

锚定^与re.S | re.M组合使用
将(?=\n\n|\n+\Z|\Z)的正向展望移到末尾。你知道吗

Example in regex101

一旦有了单独的块，就可以进一步解析这些块以找到所需的内容。举个简单的例子：

statutes={}
pat=re.compile(r'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))', re.S | re.M)
for block in pat.finditer(txt):
    m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌ed)', block.group(1))
    if m:
        statutes.setdefault(m.group(1), []).append(block.group(1))
    else:
        statutes.setdefault('Enacted', []).append(block.group(1))    

for status in sorted(statutes):
    print '{} ============\n{}\n'.format(status, '\n\n'.join(statutes[status]))

它将示例文本分为各种法规的状态（颁布、废除、修改等）

像这样：

Enacted ============
34A-6-87.1 Disposal of tire waste Collection or processing sites Penalties for violations.
     34A-6-87.1. Disposal of tire waste Collection or processing sites Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
Source:
  SL 1998, ch 202, § 1.          Source:

34A-6-89-34 Scale device required Records Report Contents Permit for longer capacity disposal.
     34A-6-89. Scale device required Records Report Contents Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
Source:
  SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:

Repealed ============
34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

Transferred ============
34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88, Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88 to 23-34-1A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

举一个regex有多简单的例子，至少在示例文本中，您可以使用Python的split方法和\n\n返回来获得相同的结果：

statutes={}
for block in txt.split('\n\n'):
    m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌ed)', block)
    if m:
        statutes.setdefault(m.group(1), []).append(block)
    else:
        statutes.setdefault('Enacted', []).append(block)   
# etc

相关问题更多 >

编程相关推荐

热门问题

热门文章