在Python中使用正则表达式搜索从PDF转换的课程

2024-10-01 13:26:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用Python编写一个正则表达式来搜索txt文档中的字符串。我要找的字符串如下所示:

  1. ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220
  2. ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285 (or ASTD 160).
  3. ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are taking any astronomy course numbered 250 or higher. For students not majoring or minoring in a science.) Prerequisite: MATH 012 or higher.
  4. ASTD 380 American Relations with China and Japan: 1740 to Present (3) (Fulfills the general education requirement in the social sciences.) A study of American political, economic, and cultural relations with China and Japan from the American colonial era to modern times…

我想让表达式找到的是以课程代码i.e. ACCT 221开始并以包含前提条件的句子结束的字符串。在某些情况下,将不会有一个先决条件句,如例4所示。你知道吗

以下是我目前掌握的情况:

[A-Z]{4} \d{3}(?:(?![A-Z]{4}).){4,100} \(\d\).*?\.(?!\))

这适用于示例1和2,但不适用于示例3(实际上,我添加了(?!\))来捕捉示例2中的内容,但没有意识到括号中有多个句子的实例,因此也有句点)。你知道吗

我想我要寻找的是一种搜索字符串的方法,该字符串以我编写到\(\d\)的表达式开始,以括号内的句点结束,无论括号在哪里。我试图在结尾处将.*添加到负面展望中,但效果不好。我试着添加.*?使它不贪婪,这样它就不会返回从第一个课程代码开始的整个文件,但是没有什么区别。你知道吗

我觉得我错过了一些很简单的东西。事先谢谢你的帮助。你知道吗

如果我需要澄清什么,请告诉我。你知道吗


Tags: orandoftheto字符串in示例
3条回答

只有在括号没有嵌套的情况下才有可能:

[A-Z]{4} \d{3}(?:(?=([^.()]+))\1|\([^)]*\))+\.

从四个字母的部门到“先决条件”后的第一个阶段,你都在找,对吧?所以说清楚点。你知道吗

>>IN:
txt = """
ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220.
ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be
taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285
(or ASTD 160).
ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are
taking any astronomy course numbered 250 or higher. For students not majoring or
minoring in a science.) Prerequisite: MATH 012 or higher."""

pat = re.compile([A-Z]{4}.*?Prerequisites?.*?\.)
courses = pat.findall(txt)
for course in courses:
    print(course+"\n")

>>OUT:
ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220.

ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be
taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285
(or ASTD 160).

ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are
taking any astronomy course numbered 250 or higher. For students not majoring or
minoring in a science.) Prerequisite: MATH 012 or higher.

对于更简单的正则表达式,使用两个正则表达式没有什么错:

import re

text = '''\
ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220
ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285 (or ASTD 160).
ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are taking any astronomy course numbered 250 or higher. For students not majoring or minoring in a science.) Prerequisite: MATH 012 or higher.
ASTD 380 American Relations with China and Japan: 1740 to Present (3) (Fulfills the general education requirement in the social sciences.) A study of American political, economic, and cultural relations with China and Japan from the American colonial era to modern times'''

courses={}
for line in text.splitlines():
    course=re.match(r'([A-Z]{4}\s+\d{3})', line).group(1)
    m=re.search(r'Prerequisites?:\s*(.*)', line)
    if m:
        pre=m.group(1)
    else:
        pre='None'    
    courses[course]=pre

print 'COURSE\t\tPREREQUISITE'    

for course in sorted(courses.keys()):
    print '{}\t{}'.format(course, courses[course]) 

印刷品:

COURSE      PREREQUISITE
ACCT 221    ACCT 220
ASTD 380    None
ASTD 485    ASTD 284 (or ASTD 150) and 285 (or ASTD 160).
ASTR 100    MATH 012 or higher.

相关问题 更多 >