分离大写的连接词

2024-06-14 09:09:20 发布

您现在位置:Python中文网/ 问答频道 /正文

使用Python,我必须编写一个基本上“清理”数据文本文件的脚本。到目前为止,我已经删除了所有不需要的字符,或者用可接受的字符替换它们(例如,破折号-可以用空格替换)。现在我已经到了必须把连在一起的词分开的地步。下面是文本文件前15行的片段

AccessibleComputing  Computer accessibility
AfghanistanHistory  History of Afghanistan
AfghanistanGeography  Geography of Afghanistan
AfghanistanPeople  Demographics of Afghanistan
AfghanistanCommunications  Communications in Afghanistan
AfghanistanMilitary  Afghan Armed Forces
AfghanistanTransportations  Transport in Afghanistan
AfghanistanTransnationalIssues  Foreign relations of Afghanistan
AssistiveTechnology  Assistive technology
AmoeboidTaxa  Amoeba
AsWeMayThink  As We May Think
AlbaniaHistory  History of Albania
AlbaniaPeople  Demographics of Albania
AlbaniaEconomy  Economy of Albania
AlbaniaGovernment  Politics of Albania

我要做的是分离在大写字母出现的位置连接的单词。例如,我希望第一行如下所示:

Accessible Computing  Computer accessibility

脚本必须接受文件输入并将结果写入输出文件。这是我目前拥有的,它根本不起作用!(不确定我是否在正确的轨道上)

import re

input_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt",'r')
output_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt",'w')

for line in input_file:
    if line.contains('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'):
        newline = line.

output_file.write(newline)

input_file.close()
output_file.close()

Tags: ofin脚本inputoutputline字符history
3条回答

这不是最好的方法,但很简单。你知道吗

from string import uppercase

s = 'AccessibleComputing Computer accessibility'

>>> ' '.join(''.join(' ' + c if n and c in uppercase else c 
                     for n, c in enumerate(word)) 
             for word in s.split())
'Accessible Computing Computer accessibility'

顺便说一句,你应该这样做你的文件读/写:

f_in = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt"
f_out = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt"

def func(line):
    processed_line = ... # your line processing function
    return processed_line

with open(f_in, 'r') as fin:
    with open(f_out, 'w') a fout:  
        for line in fin.readlines():
            fout.write(func(line))

你可以做:

re.sub(r'(?P<end>[a-z])(?P<start>[A-Z])', '\g<end> \g<start>', line)

这将在相邻的每个小写-大写字母之间插入一个空格(假设只有英文字符)

我建议使用以下正则表达式拆分单词:

import re, os

input_file = 'input.txt'
output_file = 'output.txt'

with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
        for line in f_in.readlines():
            p = re.compile(r'[A-Z][a-z]+|\S+')

            matches = re.findall(p, line)
            matches = ' '.join(matches)

            f_out.write(matches+ os.linesep)

假设数据.txt包含粘贴在帖子中的文本,它将打印:

Accessible Computing Computer accessibility
Afghanistan History History of Afghanistan
Afghanistan Geography Geography of Afghanistan
Afghanistan People Demographics of Afghanistan
Afghanistan Communications Communications in Afghanistan
Afghanistan Military Afghan Armed Forces
Afghanistan Transportations Transport in Afghanistan
Afghanistan Transnational Issues Foreign relations of Afghanistan
Assistive Technology Assistive technology
Amoeboid Taxa Amoeba
As We May Think As We May Think
Albania History History of Albania
Albania People Demographics of Albania
Albania Economy Economy of Albania
Albania Government Politics of Albania
...

相关问题 更多 >