Python从包含特定字符的行块中检索数据，并将相关数据附加到单独的行中

3条回答

网友

1楼 · 编辑于 2024-05-09 21:05:36

下面是对你问题的另一个回答：

data = """
1.  Track1  03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB

2.  Track2  04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA"""

import re
import collections

# Regular expression to pull apart the headline of each entry
headlinePattern = re.compile(r"(\d+)\.\s+(.*?)\s+(\d\d:\d\d)")

def main():
    # break the data into lines
    lines = data.strip().split("\n")

    # while we have more lines...
    while lines:

        # The next line should be a title line
        line = lines.pop(0)
        m = headlinePattern.match(line)
        if not m:
            raise Exception("Unexpected data format")
        id = m.group(1)
        title = m.group(2)
        length = m.group(3)
        people = collections.defaultdict(list)

        # Now read person lines until we hit a blank line or the end of the list
        while lines:
            line = lines.pop(0)
            if not line:
                break
            # Break the line into label and name
            label, name = re.split(r"\W+", line, 1)

            # Add this entry to a map of lists, where the map's keys are the label and the
            # map's values are all the people who had that label
            people[label].append(name)

        # Now we have everything for one entry in the data.  Print everything we got.
        print("id:", id, "title:", title, "length:", length)
        print(" - ".join(["; ".join(person) for person in people.values()]))

        # go on to the next entry...

main()

结果:

id: 1 title: Track1 length: 03:01
PersonA - LyrcistA - ComposerA - ArrangerA; ArrangerB
id: 2 title: Track2 length: 04:18
PersonB; PersonC - LyrcistA; LyrcistC - ComposerA - ArrangerA

如果你真的想让所有人都在上面，你可以把打印标题信息的那行注释掉。如果要从用户提示符读取数据，只需将内置的data替换为data = input("")

网友

2楼 · 编辑于 2024-05-09 21:05:36

假设您的数据是在名为tracks.txt的文件中指定的格式，则以下代码应该可以工作：

import re

with open('tracks.txt') as fp:
    tracklines = fp.read().splitlines()

def split_tracks(lines):
    track = []
    all_tracks = []
    while True:
        try:
            if lines[0] != '':
                track.append(lines.pop(0))
            else:
                all_tracks.append(track)
                track = []
                lines.pop(0)
        except:
            all_tracks.append(track)
            return all_tracks

def gather_attrs(tracks):
    track_attrs = []
    for track in tracks:
        attrs = {}
        for line in track:
            match = re.match('([A-Z]{3}):', line)
            if match:
                attr = line[:3]
                val = line[4:].strip()
                try:
                    attrs[attr].append(val)
                except KeyError:
                    attrs[attr] = [val]
        track_attrs.append(attrs)
    return track_attrs

if __name__ == '__main__':
    tracks = split_tracks(tracklines)
    attrs = gather_attrs(tracks)
    for track in attrs:
        semicolons = map(lambda va: '; '.join(va), track.values())
        hyphens = ' - '.join(semicolons)
        print(hyphens)

唯一需要更改的是数据中的冒号字符-其中一些是ASCII冒号:，另一些是Unicode冒号：，这将破坏正则表达式

网友

3楼 · 编辑于 2024-05-09 21:05:36

这里有一个不使用正则表达式的脚本

它假设标题行，并且只有标题行，总是以数字开头，并且标题行和信用额度的整体结构是一致的。空行将被忽略
轨迹数据的提取和格式化是分开处理的，因此更容易更改格式，或以其他方式使用提取的数据

import collections
import unicodedata


data_from_question = """\
1.  Track1  03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB

2.  Track2  04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA
"""


def prepare_data(data):
    # The "colons" in the credits lines are actually 
    # "full width colons".  Replace them (and other such characters)
    # with their normal width equivalents.
    # If full normalisation is undesirable then we could return
    # data.replace('\N{FULLWIDTH COLON}', ':')
    return unicodedata.normalize('NFKC', data)


def is_new_track(line):
    return line[0].isdigit()


def parse_track_header(line):
    id_, title, duration = line.split()
    return {'id': id_.rstrip('.'), 'title': title, 'duration': duration}


def get_credit(line):
    credit, _, name = line.partition(':')
    return credit.strip(), name.strip()


def format_track_heading(track):
    return 'id: {id} title: {title} length: {duration}'.format(**track)


def format_credits(track):
    order = ['ARR', 'COM', 'LYR', 'VOC']
    parts = ['; '.join(track[k]) for k in order]
    return ' - '.join(parts)


def get_data():
    # The data is expected to be a multiline string.
    return data_from_question


def parse_data(data):
    track = None
    for line in filter(None, data.splitlines()):
        if is_new_track(line):
            if track:
                yield track
            track = collections.defaultdict(list)
            header_data = parse_track_header(line)
            track.update(header_data)
        else:
            role, name = get_credit(line)
            track[role].append(name)
    yield track


def report(tracks):
    for track in tracks:
        print(format_track_heading(track))
        print(format_credits(track))
        print()


def main():
    data = get_data()
    prepared_data = prepare_data(data)
    tracks = parse_data(prepared_data)
    report(tracks)


main()

输出：

id: 1 title: Track1 length: 03:01
ArrangerA; ArrangerB - ComposerA - LyrcistA - PersonA

id: 2 title: Track2 length: 04:18
ArrangerA - ComposerA - LyrcistA; LyrcistC - PersonB; PersonC

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python从包含特定字符的行块中检索数据，并将相关数据附加到单独的行中

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >