在用正则表达式解析某些法规时处于停顿状态

2024-09-29 23:28:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在解析一个巨大的法规文件,我有一个特定的正则表达式用于非标准法规,因为它们与通常的模式不匹配。这是我使用的正则表达式:

\n(\d*[A-Z]?-\d*[A-Z]?-\d*[\.\d]*[A-Z]?[-\d*[\.\d]*[A-Z]?]?)(?= (?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed)\.\s*\n)(?:\s|\stt.*|\.)(?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed).\s*\n(.*?)\n\d*[A-Z]?-\d*[A-Z]?-\d*[\.\d]*[A-Z]?[-\d*[\.\d]*[A-Z]?]?

除了一些有问题的情况外,这个方法非常有效。你知道吗

  1. 当两个特殊情况紧接着出现时,它不起作用;例如:

    34A-1-28废除。 34A-1-28号。由SL 1986,ch 295,§7废除。你知道吗

    34A-1-28废除。 34A-1-28号。由SL 1986,ch 295,§7废除。

  2. 当法规看起来像这样时,它就不起作用了:34A-6-88, Transferred.(法规后面的逗号)
  3. 如果列出了一个范围:34A-6-88 to 23-34-1A Repealed.

任何帮助解决这三个问题都将不胜感激。为了方便起见,我已经建立了一个regex101,其中包含了一大块我想要标记here的法规。你知道吗


Tags: 文件not情况ch非标准executedobsoletereserved
2条回答

如果你需要一个复杂的正则表达式,一步一步地构建它是很重要的。那是避免迷路的唯一方法。你知道吗

开始前注意两个问题:

  • 我不熟悉法律术语。我的术语可能全错了。

  • 我将使用verbose flag。有了这个标志,您可以在正则表达式中自由地插入空格,以提高可读性。

让我们从法规号开始,定义一个解析单个组件的正则表达式(例如34A83.1)。你知道吗

nbr = r'\d+ (?: \. \d+ )? [A-Z]?'

其中三到五个组成部分,用破折号隔开,构成一个完整的法规编号。你知道吗

statute = r'%(nbr)s (?: - %(nbr)s ){2,4}' % {
    'nbr': nbr
}

有了这个,我们就可以定义一个既匹配一个规约又匹配一个范围的正则表达式。我们用两个小组来收集法规。第二个将是空的,因为没有给出范围。你知道吗

statute_or_range = r'(%(statute)s) (?: \s+ to \s+ (%(statute)s) )?' % {
    'statute': statute
}

现在我们可以构造一个模式来匹配整个第一行。在这一点上,很容易处理有时出现的逗号。你知道吗

action = r'(?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed)'

first_line = r'%(statute_or_range)s ,? \s+ %(action)s \. \s+' %{
    'statute_or_range': statute_or_range,
    'action': action
}

我不太清楚你要匹配多少文本。我的印象是,你想捕捉到下一个法令的开头,这是一个以法令编号开始的行。所以:

end = r'(?= \n %(statute)s )' % {
    'statute': statute
}

将这些组合起来,就可以得到正则表达式:

pattern = r'%(first_line)s (.*?) %(end)s' % {
    'first_line': first_line,
    'end': end
}

regex = re.compile(pattern, re.VERBOSE | re.DOTALL | re.IGNORECASE)

See it in action.

示例文本:

34A-6-87.1 Disposal of tire waste Collection or processing sites Penalties for violations.
     34A-6-87.1. Disposal of tire waste Collection or processing sites Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
Source:
  SL 1998, ch 202, § 1.          Source:

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88, Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88 to 23-34-1A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-6-89-34 Scale device required Records Report Contents Permit for longer capacity disposal.
     34A-6-89. Scale device required Records Report Contents Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
Source:
  SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:

我假设你想把这段文字分成块,用引用的法规隔开。你知道吗

如果是这样,简化regex。你可以做:

'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?\s+.*?(?=\n\n|\n+\Z|\Z))'

^ assert position at start of a line
1st Capturing group (\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))
\d+ match a digit [0-9]
\w+ match any word character [a-zA-Z0-9_]
- matches the character - literally
\d+ match a digit [0-9]
- matches the character - literally
\d+ match a digit [0-9]
(?:[,.\-0-9A-Z]+)? Non-capturing group
[ \t]+ match a single character present in the list below
.*? matches any character
(?=\n\n|\n+\Z|\Z) Positive Lookahead - Assert that the regex below can be matched
1st Alternative: \n\n
2nd Alternative: \n+\Z
3rd Alternative: \Z
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
s modifier: single line. Dot matches newline characters

注:

  1. 锚定^re.S | re.M组合使用
  2. (?=\n\n|\n+\Z|\Z)的正向展望移到末尾。你知道吗

Example in regex101

一旦有了单独的块,就可以进一步解析这些块以找到所需的内容。举个简单的例子:

statutes={}
pat=re.compile(r'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))', re.S | re.M)
for block in pat.finditer(txt):
    m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌​ed)', block.group(1))
    if m:
        statutes.setdefault(m.group(1), []).append(block.group(1))
    else:
        statutes.setdefault('Enacted', []).append(block.group(1))    

for status in sorted(statutes):
    print '{} ============\n{}\n'.format(status, '\n\n'.join(statutes[status]))  

它将示例文本分为各种法规的状态(颁布、废除、修改等)

像这样:

Enacted ============
34A-6-87.1 Disposal of tire waste Collection or processing sites Penalties for violations.
     34A-6-87.1. Disposal of tire waste Collection or processing sites Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
Source:
  SL 1998, ch 202, § 1.          Source:

34A-6-89-34 Scale device required Records Report Contents Permit for longer capacity disposal.
     34A-6-89. Scale device required Records Report Contents Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
Source:
  SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:

Repealed ============
34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

Transferred ============
34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88, Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88 to 23-34-1A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

举一个regex有多简单的例子,至少在示例文本中,您可以使用Python的split方法和\n\n返回来获得相同的结果:

statutes={}
for block in txt.split('\n\n'):
    m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌​ed)', block)
    if m:
        statutes.setdefault(m.group(1), []).append(block)
    else:
        statutes.setdefault('Enacted', []).append(block)   
# etc 

相关问题 更多 >

    热门问题