分离大写的连接词

AccessibleComputing Computer accessibility AfghanistanHistory History of Afghanistan AfghanistanGeography Geography of Afghanistan AfghanistanPeople Demographics of Afghanistan AfghanistanCommunications Communications in Afghanistan AfghanistanMilitary Afghan Armed Forces AfghanistanTransportations Transport in Afghanistan AfghanistanTransnationalIssues Foreign relations of Afghanistan AssistiveTechnology Assistive technology AmoeboidTaxa Amoeba AsWeMayThink As We May Think AlbaniaHistory History of Albania AlbaniaPeople Demographics of Albania AlbaniaEconomy Economy of Albania AlbaniaGovernment Politics of Albania

import re input_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt",'r') output_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt",'w') for line in input_file: if line.contains('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'): newline = line. output_file.write(newline) input_file.close() output_file.close()

3条回答

网友

1楼 · 编辑于 2024-06-14 09:09:20

这不是最好的方法，但很简单。你知道吗

from string import uppercase

s = 'AccessibleComputing Computer accessibility'

>>> ' '.join(''.join(' ' + c if n and c in uppercase else c 
                     for n, c in enumerate(word)) 
             for word in s.split())
'Accessible Computing Computer accessibility'

顺便说一句，你应该这样做你的文件读/写：

f_in = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt"
f_out = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt"

def func(line):
    processed_line = ... # your line processing function
    return processed_line

with open(f_in, 'r') as fin:
    with open(f_out, 'w') a fout:  
        for line in fin.readlines():
            fout.write(func(line))

网友

2楼 · 编辑于 2024-06-14 09:09:20

你可以做：

re.sub(r'(?P<end>[a-z])(?P<start>[A-Z])', '\g<end> \g<start>', line)

这将在相邻的每个小写-大写字母之间插入一个空格（假设只有英文字符）

网友

3楼 · 编辑于 2024-06-14 09:09:20

我建议使用以下正则表达式拆分单词：

import re, os

input_file = 'input.txt'
output_file = 'output.txt'

with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
        for line in f_in.readlines():
            p = re.compile(r'[A-Z][a-z]+|\S+')

            matches = re.findall(p, line)
            matches = ' '.join(matches)

            f_out.write(matches+ os.linesep)

假设数据.txt包含粘贴在帖子中的文本，它将打印：

Accessible Computing Computer accessibility
Afghanistan History History of Afghanistan
Afghanistan Geography Geography of Afghanistan
Afghanistan People Demographics of Afghanistan
Afghanistan Communications Communications in Afghanistan
Afghanistan Military Afghan Armed Forces
Afghanistan Transportations Transport in Afghanistan
Afghanistan Transnational Issues Foreign relations of Afghanistan
Assistive Technology Assistive technology
Amoeboid Taxa Amoeba
As We May Think As We May Think
Albania History History of Albania
Albania People Demographics of Albania
Albania Economy Economy of Albania
Albania Government Politics of Albania
...

相关问题更多 >

编程相关推荐

热门问题

热门文章