文本fi上的Python正则表达式

2024-10-03 21:29:32 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨,伙计们,我有关于regex的问题,我需要一些帮助。我的代码看起来像:

# -*- coding: utf-8 -*-
import re

WEEKDAYS = ["nedjelja", "utorak", "četvrtak", "ponedjeljak", "subota", "srijeda", "petak"]:

with open('natio_geo_channel.xml', 'r') as input_file, \
        open('nat.xml', 'w') as output_file:
    for line in input_file:
        for x in WEEKDAYS
            line = line.replace("<para>" + x, "<date>")
        line = re.sub(r"<para>\d{0}", "<start>", line)
        line = re.sub(r"<start>\d{2}\.\d{2}\s/\s/", "</start>", line)
        output_file.write(line)

我的文件看起来像:

<para>nedjelja1. rujna 2013.</para>
    <para>06.00        na hrvatskom Zona gradnje: Izgradnja zelenog Pekinga</para>
    <para>Kineske nevolje sa zagađenjem problem su s globalnim posljedicama. Pratite ekipu zelenih inženjera koji grade energetski učinkoviti Peking.</para>

我所做的,我先用替换,然后去掉内德耶尔贾,没关系。 但我该怎么做:

<start>06:00<start><title>Zona gradnje</title><sub>Izgradnja zelenog Pekinga</sub>

从此<para>06.00 na hrvatskom Zona gradnje: Izgradnja zelenog Pekinga</para>

你能给我一些建议或想法吗?你知道吗


Tags: inreaslinexmlopenstartfile
2条回答

请试试这个:

def main():    
    line = r'<para>06.00        na hrvatskom Zona gradnje: Izgradnja zelenog Pekinga</para>'        
    if re.search(r'^(<para>)\d{2}.\d{2}', line):
        line_time = re.findall(r'\d{2}.\d{2}',line)[0]       
        line_title = line[line.find(re.findall(r'[A-Z]',line)[0]):line.find(':')]        
        line_sub = line[line.find(':')+1:line.find(r'</')].strip()

        print '<start>'+line_time+'</start><title>'+line_title+'</title><sub>'+line_sub+'</sub>'   

如果这是你需要的,请告诉我。你知道吗

输出

<start>06.00</start><title>Zona gradnje</title><sub>Izgradnja zelenog Pekinga</sub>

用于从文件中读取行:

 with open(r'D:\Trading\PythonScholar\input\input.tx','r') as file: 
        for line in file:            
            if re.search(r'^(<para>)\d{2}.\d{2}', line.strip()): 
                line_time = re.findall(r'\d{2}.\d{2}',line)[0] 
                line_title = line[line.find(re.findall(r'[A-Z]',line)[0]):line.find(':')] 
                line_sub = line[line.find(':')+1:line.find(r'</')].strip() 
                print '<start>'+line_time+'</start><title>'+line_title+'</title><sub>'+line_sub+'</sub>'

希望这有帮助。你知道吗

要转换此项:

<para>06.00        na hrvatskom Zona gradnje: Izgradnja zelenog Pekinga</para>

对此:

<start>06:00<start><title>Zona gradnje</title><sub>Izgradnja zelenog Pekinga</sub>

请执行以下操作:

str = re.sub(".*?>(\S+)(?:\s+\S+){2}\s+(.*?):\s*(.*)<.*", 
    "<start>\1<start><title>\2</title><sub>\3</sub>", str)

相关问题 更多 >