在Python中解析不带分隔符的文本文件

2024-06-28 20:09:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个电影剧本《逝者》,我想通过角色的名字来解析数据。文本文件的格式没有分隔符,但是它的字符名都是BILLY。我唯一的标识符是所有大写字母的名字。我通读了regex和其他线程,但我不确定从哪里开始。。。。。在

file = open("Departed.txt","r")
data = file.read()
pattern = re.compile(r'BILLY')
matches = pattern.finditer(data)
for match in matches:
    print(match)

这仍然返回整个脚本。。。 https://pastebin.com/226VzLWu


Tags: 数据角色data格式match名字字符file
3条回答

这里有一个快速完成的方法(您仍然需要对此进行清理,但我认为您的答案在这里):

import re
with open('Departed.txt', 'r') as f:
    data = f.read()

# match all words or sequences of words that are all caps
scene_or_character_re = re.compile(r'\b([A-Z][A-Z\W]+)\W')

groupings = scene_or_character_re.split(data)
# groupings is a list of strings, alternating caps, normal, caps, normal

def cleanup_spaces(s):
    '''helper function to replace whitespace with single spaces'''
    return re.sub(r'\s\s+', ' ', s).strip()

# split list into tuples of length two (caps, normal) 
pointer = iter(groupings)
groups = []
for p in pointer:
    if cleanup_spaces(p) == '':
        continue  # skip blank lines
    actor = cleanup_spaces(p)
    line = cleanup_spaces(next(pointer))  # this also increments the iterator used by the `for` loop
    groups.append((actor, line))

这将为您提供:

^{pr2}$

Python已经在regex模块中内置了split,所以请尝试:

import re
re.split(r"\W(?=\b[A-Z ]+\b)", str(data), 0 , re.X )

我的输出基于您的评论(我使用的是Python3):

结果列表:['Not sure if this helps, but this is some sample text.', 'YOUNG', 'COLIN Yeah.', "COSTELLO tells the Proprietor to takes three loaves of bread and some soup off the shelves and puts them in Colin's bag.", 'COSTELLO Get him three loaves of bread. And a couple of half gallons of milk. And some soup. He goes over to the fridge and puts two half gallons of milk in the bag. Some soup. Costello turns to Colin.']

格式有点混乱,但是仅仅依赖大写会导致一些问题,因为当脚本中有另一个字符提到这些名称时,这些名称都是大写的。我发现的最好的结果似乎是将8个空格转换成制表符,然后在每个制表符组后面拆分行。结果如下:

     THE DEPARTED
 Written by
     William Monahan
  Based on Infernal Affairs
      SCRIPT AS SHOT COMPILED SEPTEMBER 2006
FADE UP ON
    THE SOUTH BOSTON HOUSING PROJECTS. A MAZE OF BUILDINGS
  AGAINST THE HARBOR.
   COSTELLO (V.O.)
       I don't want to be a product of my
       environment. I want my environment
       to be a product...of me.
    YELLOW RIPPLES PAST THE CAMERA AND WHEN IT CLEARS WE SEE
  THROUGH DIESEL SMOKE: A BUSING PROTEST IN PROGRESS. THE
  SCHOOL-BUS, FULL OF BLACK KIDS, IS HIT WITH BRICKS, ROCKS.
  N.B.: (THIS IS NOT SETTING THE LIVE ACTION IN 1974; IT IS A
  HISTORICAL MONTAGE, THE BACKGROUND FOR COSTELLO'S V.O.).
    INT. THE AUTOBODY SHOP. DAY.
    COSTELLO's profile passes in a dark room.
   COSTELLO (V.O.)
       Years ago, we had the Church. That
       was only a way of saying we had
       each other. The Knights of Columbus
       were head-breakers. They took over
       their piece of the city.
    EXT. SOUTHIE. VARIOUS
    The neighborhood. 1980's. We won't be here long. This isn't
  where Costello ends up. It's where he began. Liquor stores
  with shamrocked signs. MEN FISHING near Castle Island.
  Catholic SCHOOLKIDS playing in an asphalted schoolyard.
   COSTELLO (V.O.)
       Twenty years after an Irishman
       couldn't get a job, we had the
       presidency. That's what the niggers
       don't realize. If I got one thing
       against the black chaps it's this.
       No one gives it to you. You have to
       take it.
    INT. LUNCH COUNTER. DAY
    COSTELLO comes in. The shop is one that sells papers,
  sundries, fountain drinks...and fronts a bookie operation.
   YOUNG COSTELLO
   (leaning over cluttered
    counter)
       Don't make me have to come down
       here again.
 (CONTINUED)
  2.
  CONTINUED:
     PROPRIETOR
       Won't happen again, Mr. C.
    The frightened proprietor hands over money. Fifty bucks, a
  hundred, doesn't matter. COSTELLO is never the threatener.
  His demeanor is gentle, philosophical. Almost a shrink's
  probing bedside manner. He has great interest in the world
  as he moves through it. As if he originally came from a
  different world and his survival in this one depends on close
  continual observation and analysis.
    YOUNG COLIN looks up. CLOSE ON his eyes. He is fourteen or
  fifteen, but small for his age. Bookish.
    COSTELLO eyes the proprietor's TEENAGE DAUGHTER, working
  behind the counter. He takes a propane lighter, and,
  strangely, pays for it (the proprietor startled) and waits
  for change. He lights a MORE cigarette with the lighter.
   YOUNG COSTELLO
       Carmen's developing into a fine
       young lady. You should be proud.
       You get your period yet, Carmen?
    The PROPRIETOR is uneasy. COSTELLO turns to YOUNG COLIN
  (about 14) staring at the local hero. Costello reaches up
  above and behind the counter and takes down some cigarettes.
   YOUNG COSTELLO (CONT'D)
       You Johnny Sullivan's kid?
    COLIN nods.
   YOUNG COSTELLO (CONT'D)
       You live with your grandmother?
    COLIN nods.
   YOUNG COLIN
       Yeah.
    COSTELLO tells the Proprietor to takes three loaves of bread
  and some soup off the shelves and puts them in Colin's bag.
   COSTELLO
       Get him three loaves of bread. And
       a couple of half gallons of milk.
       And some soup.
    He goes over to the fridge and puts two half gallons of milk
  in the bag. Some soup. Costello turns to Colin.
       (CONTINUED)

对于这种转变,这里有一句话:

^{pr2}$

然后您可以看到一些格式,并可能区分行和方向。此脚本尝试猜测内容类型(尽管稍后确实有点困难)

from __future__ import print_function
import re

filename = 'Script_DepartedThe.txt'
data = open(filename).read()
script = re.sub('(        )+', '\n', data)

leading_space_re = re.compile('^ +')

block_types = {
        7: 'LINE:      ',
        2: 'DIRECTION: ',
        4: 'DIRECTION: ',
        3: 'CHARACTER: ',
        }

for line in script.split('\n'):
    # number of leading spaces designates block type
    match = leading_space_re.match(line)
    count = 0 if match is None else len(match.group())
    current_block = block_types.get(count, 'UNKNOWN:   ')

    print(current_block, count, line)

输出:

UNKNOWN:    0 
UNKNOWN:    5      THE DEPARTED
UNKNOWN:    1  Written by
UNKNOWN:    5      William Monahan
DIRECTION:  2   Based on Infernal Affairs
UNKNOWN:    6       SCRIPT AS SHOT COMPILED SEPTEMBER 2006
UNKNOWN:    0 FADE UP ON
DIRECTION:  4     THE SOUTH BOSTON HOUSING PROJECTS. A MAZE OF BUILDINGS
DIRECTION:  2   AGAINST THE HARBOR.
CHARACTER:  3    COSTELLO (V.O.)
LINE:       7        I don't want to be a product of my
LINE:       7        environment. I want my environment
LINE:       7        to be a product...of me.
DIRECTION:  4     YELLOW RIPPLES PAST THE CAMERA AND WHEN IT CLEARS WE SEE
DIRECTION:  2   THROUGH DIESEL SMOKE: A BUSING PROTEST IN PROGRESS. THE
DIRECTION:  2   SCHOOL-BUS, FULL OF BLACK KIDS, IS HIT WITH BRICKS, ROCKS.
DIRECTION:  2   N.B.: (THIS IS NOT SETTING THE LIVE ACTION IN 1974; IT IS A
DIRECTION:  2   HISTORICAL MONTAGE, THE BACKGROUND FOR COSTELLO'S V.O.).
DIRECTION:  4     INT. THE AUTOBODY SHOP. DAY.
DIRECTION:  4     COSTELLO's profile passes in a dark room.

相关问题 更多 >