Python从包含特定字符的行块中检索数据,并将相关数据附加到单独的行中

2024-05-09 21:05:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图创建一个程序,从批量粘贴中选择特定信息,提取相关信息,然后继续将所述信息粘贴到行中

下面是一些示例数据

1.  Track1  03:01
VOC:PersonA 
LYR:LyrcistA
COM:ComposerA
ARR:ArrangerA
ARR:ArrangerB

2.  Track2  04:18
VOC:PersonB
VOC:PersonC
LYR:LyrcistA
LYR:LyrcistC
COM:ComposerA
ARR:ArrangerA

我希望有一个输出,其中Track1的相关数据分组在一行中,分号连接相同的信息,其他信息之间用“-”分隔

LyrcistA  -  ComposerA  -  ArrangerA; ArrangerB
LyrcistA; LyrcistC  -  ComposerA  -  ArrangerA

尽管我尽了最大的努力,我还是没有走多远

while True:
    YodobashiData = input("")
    SplitData = YodobashiData.splitlines();

返回以下内容

['1.  Track1  03:01']
['VOC:PersonA ']
['LYR:LyrcistA']
['COM:ComposerA']
['ARR:ArrangerA']
['ARR:ArrangerB']
[]
['2.  Track2  04:18']
['VOC:PersonB']
['VOC:PersonC']
['LYR:LyrcistA']
['LYR:LyrcistC']
['COM:ComposerA']
['ARR:ArrangerA']

虽然我现在所有的数据都在单独的列表中,但我不知道如何从我不需要的列表中识别和提取我需要的信息。 另外,我似乎需要使用while循环,否则它将只返回第一个列表,而不返回其他任何内容


Tags: 数据com信息列表粘贴vocarrpersona
3条回答

下面是对你问题的另一个回答:

data = """
1.  Track1  03:01
VOC:PersonA
LYR:LyrcistA
COM:ComposerA
ARR:ArrangerA
ARR:ArrangerB

2.  Track2  04:18
VOC:PersonB
VOC:PersonC
LYR:LyrcistA
LYR:LyrcistC
COM:ComposerA
ARR:ArrangerA"""

import re
import collections

# Regular expression to pull apart the headline of each entry
headlinePattern = re.compile(r"(\d+)\.\s+(.*?)\s+(\d\d:\d\d)")

def main():
    # break the data into lines
    lines = data.strip().split("\n")

    # while we have more lines...
    while lines:

        # The next line should be a title line
        line = lines.pop(0)
        m = headlinePattern.match(line)
        if not m:
            raise Exception("Unexpected data format")
        id = m.group(1)
        title = m.group(2)
        length = m.group(3)
        people = collections.defaultdict(list)

        # Now read person lines until we hit a blank line or the end of the list
        while lines:
            line = lines.pop(0)
            if not line:
                break
            # Break the line into label and name
            label, name = re.split(r"\W+", line, 1)

            # Add this entry to a map of lists, where the map's keys are the label and the
            # map's values are all the people who had that label
            people[label].append(name)

        # Now we have everything for one entry in the data.  Print everything we got.
        print("id:", id, "title:", title, "length:", length)
        print(" - ".join(["; ".join(person) for person in people.values()]))

        # go on to the next entry...

main()

结果:

id: 1 title: Track1 length: 03:01
PersonA - LyrcistA - ComposerA - ArrangerA; ArrangerB
id: 2 title: Track2 length: 04:18
PersonB; PersonC - LyrcistA; LyrcistC - ComposerA - ArrangerA

如果你真的想让所有人都在上面,你可以把打印标题信息的那行注释掉。如果要从用户提示符读取数据,只需将内置的data替换为data = input("")

假设您的数据是在名为tracks.txt的文件中指定的格式,则以下代码应该可以工作:

import re

with open('tracks.txt') as fp:
    tracklines = fp.read().splitlines()

def split_tracks(lines):
    track = []
    all_tracks = []
    while True:
        try:
            if lines[0] != '':
                track.append(lines.pop(0))
            else:
                all_tracks.append(track)
                track = []
                lines.pop(0)
        except:
            all_tracks.append(track)
            return all_tracks

def gather_attrs(tracks):
    track_attrs = []
    for track in tracks:
        attrs = {}
        for line in track:
            match = re.match('([A-Z]{3}):', line)
            if match:
                attr = line[:3]
                val = line[4:].strip()
                try:
                    attrs[attr].append(val)
                except KeyError:
                    attrs[attr] = [val]
        track_attrs.append(attrs)
    return track_attrs

if __name__ == '__main__':
    tracks = split_tracks(tracklines)
    attrs = gather_attrs(tracks)
    for track in attrs:
        semicolons = map(lambda va: '; '.join(va), track.values())
        hyphens = ' - '.join(semicolons)
        print(hyphens)

唯一需要更改的是数据中的冒号字符-其中一些是ASCII冒号:,另一些是Unicode冒号,这将破坏正则表达式

这里有一个不使用正则表达式的脚本

  • 它假设标题行,并且只有标题行,总是以数字开头,并且标题行和信用额度的整体结构是一致的。空行将被忽略

  • 轨迹数据的提取和格式化是分开处理的,因此更容易更改格式,或以其他方式使用提取的数据

import collections
import unicodedata


data_from_question = """\
1.  Track1  03:01
VOC:PersonA
LYR:LyrcistA
COM:ComposerA
ARR:ArrangerA
ARR:ArrangerB

2.  Track2  04:18
VOC:PersonB
VOC:PersonC
LYR:LyrcistA
LYR:LyrcistC
COM:ComposerA
ARR:ArrangerA
"""


def prepare_data(data):
    # The "colons" in the credits lines are actually 
    # "full width colons".  Replace them (and other such characters)
    # with their normal width equivalents.
    # If full normalisation is undesirable then we could return
    # data.replace('\N{FULLWIDTH COLON}', ':')
    return unicodedata.normalize('NFKC', data)


def is_new_track(line):
    return line[0].isdigit()


def parse_track_header(line):
    id_, title, duration = line.split()
    return {'id': id_.rstrip('.'), 'title': title, 'duration': duration}


def get_credit(line):
    credit, _, name = line.partition(':')
    return credit.strip(), name.strip()


def format_track_heading(track):
    return 'id: {id} title: {title} length: {duration}'.format(**track)


def format_credits(track):
    order = ['ARR', 'COM', 'LYR', 'VOC']
    parts = ['; '.join(track[k]) for k in order]
    return ' - '.join(parts)


def get_data():
    # The data is expected to be a multiline string.
    return data_from_question


def parse_data(data):
    track = None
    for line in filter(None, data.splitlines()):
        if is_new_track(line):
            if track:
                yield track
            track = collections.defaultdict(list)
            header_data = parse_track_header(line)
            track.update(header_data)
        else:
            role, name = get_credit(line)
            track[role].append(name)
    yield track


def report(tracks):
    for track in tracks:
        print(format_track_heading(track))
        print(format_credits(track))
        print()


def main():
    data = get_data()
    prepared_data = prepare_data(data)
    tracks = parse_data(prepared_data)
    report(tracks)


main()

输出:

id: 1 title: Track1 length: 03:01
ArrangerA; ArrangerB - ComposerA - LyrcistA - PersonA

id: 2 title: Track2 length: 04:18
ArrangerA - ComposerA - LyrcistA; LyrcistC - PersonB; PersonC

相关问题 更多 >