使用未换行符清理制表符分隔的文件

JOB REF Comment V2 Other 1 3 45 This was a small job NULL sdnsdf 2 4 456 This was a large job and I have to go onto a new line, but I didn't properly escape so it's on the next row whoops! NULL NULL 3 7 354 NULL NULL NULL # dat <- readLines("the-Dirty-Tab-Delimited-File.txt") dat <- c("\tJOB\tREF\tComment\tV2\tOther", "1\t3\t45\tThis was a small job\tNULL\tsdnsdf", "2\t4\t456\tThis was a large job and I have\t\t", "\t\"to go onto a new line, but I didn't properly escape so it's on the next row whoops!\"\tNULL\tNULL\t\t", "3\t7\t354\tNULL\tNULL\tNULL")

140338 28855 WA 2 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL 1 NULL NULL NULL NULL NULL NULL NULL NULL 1000 NULL NULL NULL NULL NULL NULL YNNNNNNN (Some text with two newlines) The remainder of the text beneath two newlines NULL NULL NULL 3534a NULL email NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL

2条回答

网友

1楼 · 编辑于 2024-06-28 11:37:12

不需要正则表达式

with open("filename", "r") as data:
    datadict={}
    for count,linedata in enumerate(data):
        datadict[count]=linedata.split('\t')

extra_line_numbers=[]
for count,x in enumerate(datadict):
    if count==0: #get rid of the first line
        continue
    if not datadict[count][1].isdigit(): #if item #2 isn't a number
        datadict[count-1][3]=datadict[count-1][3]+datadict[count][1]
        datadict[count-1][4:6]=(datadict[count][2],datadict[count][3])
        extra_line_numbers.append(count)

for x in extra_line_numbers:
    del(datadict[x])

with open("newfile",'w') as data:
    data.writelines(['\t'.join(x)+'\n' for x in datadict.values()])

网友

2楼 · 编辑于 2024-06-28 11:37:12

这是我用Python语言给出的答案。在

import re

# This pattern should match correct data lines and should not
# match "continuation" lines (lines added by the unquoted newline).
# This pattern means: start of line, then a number, then white space,
# then another number, then more white space, then another number.

# This program won't work right if this pattern isn't correct.
pat = re.compile("^\d+\s+\d+\s+\d+")

def collect_lines(iterable):
    itr = iter(iterable)  # get an iterator

    # First, loop until we find a valid line.
    # This will skip the first line with the "header" info.
    line = next(itr)
    while True:
        line = next(itr)
        if pat.match(line):
            # found a valid line; hold it as cur
            cur = line
            break
    for line in itr:
        # Look at the line after cur.  Is it a valid line?
        if pat.match(line):
            # Line after cur is valid!
            yield cur  # output cur
            cur = line  # hold new line as new cur
        else:
            # Line after cur is not valid; append to cur but do not output yet.
            cur = cur.rstrip('\r\n') + line
    yield cur

data = """\
   JOB  REF Comment V2  Other
@@@1   3   45  This was a small job    NULL    sdnsdf
@@@2   4   456 This was a large job and I have to go onto a new line, 
@@@    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
@@@3   7   354 NULL    NULL    NULL
"""

lines = data.split('@@@')
for line in collect_lines(lines):
    print(">>>{}<<<".format(line))

对于您真正的计划：

^{pr2}$

编辑：我修改了这个，并添加了更多的评论。我想我也解决了你看到的问题。在

当我将一行连接到cur时，我没有首先从cur的末尾去掉新行。所以，连接的行仍然是一个拆分行，当它被写到文件中时，这并不能真正修复问题。现在就试试吧。在

我重新处理了测试数据，这样测试线就有了新行。我最初的测试将输入拆分成新行，因此拆分的行不包含任何换行。现在这些线将以新行结束。在

相关问题更多 >

编程相关推荐

热门问题

热门文章