2024-09-29 00:13:30 发布
网友
嗨,我有一个文档,我链接了here,它有换行符。但是这些换行符不是由'/n'创建的,因为当我使用strip()甚至line[:-2]时,似乎无法摆脱它们。我想知道如何删除一些换行符--主要是在页面上运行的行,如:
strip()
line[:-2]
Wimer John, gauger & cooper, 232 N Broad, h 1511 Callowhill
如果有帮助的话,这是pytessaract OCR文本。你知道吗
谢谢你
卡梅隆
在我看来,文件的每个“记录”都由两个换行符分隔,其中“换行符”表示DOS样式的行尾,也就是说,一个回车('\r')后跟一个换行符('\n')。因此,我们应该首先对'\r\n\r\n'上的字节流进行拆分,以获得每个记录一个元素。你知道吗
'\r'
'\n'
'\r\n\r\n'
然后,我们可以用replace()替换记录中嵌入的不需要的换行符(必须是未配对的)。在我看来,某些一次性嵌入的换行符前面有一个破折号,因此可能需要用空字符串替换'-\r\n',以重新连接连字符包装的文本片段,但在此之后,我们应该用单个空格替换任何剩余的未成对换行符。你知道吗
replace()
'-\r\n'
因此我们有:
import re; file = open('1128.txt'); lines = re.split('\r\n\r\n',file.read()); lines = map(lambda x: x.replace('-\s*\r\n','').replace('\r\n',' '),lines); for e in lines: print e; ## VVIL ## 1076 ## VVIN ## ## WILSTACH WILLIAM P. & CO. (Wgflliam P. TVz'Zsz‘ac7z Q» C/mrles Scott), saddlery hardware, 38 N 3d ## Wilston John (c), nightwork, 917 S 9th ## Wilt Abraham, carter, 915 Coates ## Wilt Abraham, gentleman, 416 N 3d ## Wilt Alpheus, sash 3: doors, 425 N Front, h 1114 Columbia av ## Wilt Charles, blacksmith, N 40th 11 Lancaster av ## Wilt Charles, flour & feed store, 1306 South ## Wilt Conrad, butcher, stall 33 Kater Market, h N W Wharton & Church ## Wilt George, carter, 1135 Brown ## Wilt George A., despatcher, Reading av & Richmond, h 1114 E Columbia av ## Wilt Henry, tinsmith, 888 N 2d, h 1007 Olive ## Wilt Jacob, cloak manuf., 230 Crown ## Wilt Jacob, shoemaker, 819 St John ## Wilt Jacob J., shipjoiner, 1037 Sarah ## Wilt James A., dealer in fancy goods, 230 Crown ## Wilt James G., machinist, Innes ab Allen ## Wilt John F., clerk, 528 N 2d, h 1114 Columbia av ## Wilt Joseph, chandler, 2327 Coates ## Wilt Joseph L., sheetiron worker, Lancaster av, ## Wilt Paul, heaters, 425 York av ## Wilt William, laborer, r 2325 Coates ## Wilt William, livery stable, 914 Brown, h 10th 13 Brown ## Wilt William, contractor, 719 N 10th ## Wilt William, laborer, Gordon n Cedar ## Wiltbank Daffy, washerw., 3 Price’s ct ## Wiltbank Elizabeth, widow John, 1105 Arch ## Wiltbank Elizabeth M., widow, 1521 Locust ## Wiltbank Samuel P. , broker, 1807 Delancey pl ## Wiltbank W. White, 1521 Locust ## Wiltberger A., druggist, 233 N 2d, h 329 N 5th ## Wiltberger D. S., com. mer., 220 Chestnut, h 329 N 5th _ ## Wiltberger Harry A., accountant, Market n 40th ## Wiltberger I. P.. clerk, 309 Branch ## Wiltberger Jacob H., hardware, 225 N 2d, h 711 Wallace ## Wiltberger Richard, tavern, 119 Callowhill ## Wiltberger Theodore M., Market n 40th ## Wiltberger Theodore P., clerk, Market n 40th ## \Vilter George, weaver, S E Dauphin & Amber ## Wilthew Charlton, puddler, 1368 Beach ## VVimer Albert, clerk, 1224 S 6th ## Wimer Annie M., dressmaker, 34 N 8th ## Wimer Augustus, beamer, 13 Cresson, Myk ## Wimer Daniel C., carver, 1402 Mervine ## Wimer Elizabeth B., dry goods, 1511 Callowhill ## Wimer Hannah, wid. Thomas, 1041 Buttonwood ## Wimer John, gauger & cooper, 232 N Broad, h 1511 Callowhill ## Wimer John A., sexton, 210 Bache ## \Vimer John C., cooper, 34 N 8th ## W'imer Joseph, collector, 1224 S 6th ## Wimer Margaret, widow Andrew, 720 S 3d ## \Vimer Wesley P., cooper, 1511 Callowhill ## Wimer William W., bookkeeper P R R 13th & Market, h 1805 Callowhill ## Wimley George H., ship chandler, 512 & 514 S Del av, h 244 Crown ## Wimley John, shoemaker, r 303 Brown ## Wimley William, baker, 244 Crown ## \Vimpfheimer Augustus, salesman, 400 Callowhill ## Wimpfheimer Caroline, widow Abraham, hair dresses & silk nets, 402 N 2d ## Wimpfheimer David, manuf. vinegar,_431 N 3d ## W'impfheimer Jacob, leather, 318 New ## Wimpfheimer Jacob & Co. (Jacob lVi-mpflzeimcr), importer, 400 Callowhill ## Wimpfheimer Joseph, jeweller, 310 N 3d ## Wimpfheimer Maxwell, bookkeeper, 431 N 3d, h 469 N 4th ## Wims Mary S., widow George, Dauphin E Carroll ## Winans Elihu M., tinsmith, 2044 Ridge av ## Winans George, painter, 2044 Ridge av ## Winans Randolph, printer, 2044 Ridge av ## Winberg William H., gentleman, 1428 Marshall ## Winberger Charles, fringes, 120 Coates ## WINCH ALDEN, newspaper ag’t, 320 Chestnut, h Arch ab 13th ## Winch C., spike ma.nuf., Beach ab Warren ## Winchell William E., sailmaker, 7 Grover ## Winchester Augustus, gents’ furnishing goods, 706 Chestnut, h 734 S 9th ## Winchester & C0. (Augustus Wizzcizestcr .5, Wm. S. Marti72.), gents’ fur-’g store, 706 Chestnut ## Winchester James, weaver, Hope bel Putnam ## Winchester John, carpenter, Ridge av, Rox ## Winchester John, weaver, 1612 Philip ## Winchester John, weaver, 135 Thompson ## Winchester John, grocer, 301 Thompson ## Winchester Margaret, wid Robert, 324 Dean ## Winchester Robert, machinist, 135 Thompson ## W'inchester Samuel, merchant, 236 Market, h 258 S 10th ## Winchester William, weaver, 135 Thompson ## Winchester William W., bookkeeper, 307 Branch, h 2101 Oxford ## Windel Hannah, teacher, N 41st 11 Market ## Vllinder Ernest, carpenter, 1124 Sophia ## Winder Frederick, tailor, 1157 Passyunk rd ## Winder Harman, hotel, 926 N Front ## Winder John, driver, Daniel pl ## Winder John B., gentleman, Herman, Gtn ## Winder Joseph, hotel, 76 Frankford ## Winder Robert, carman, 906 N 12th ## Winder Sebastian, shoemaker, Ne1son’s ct ## Winder W. H., mer. 314-}; Walnut, h 415 S 15th ## Winderly Charles, shoemaker, York n Trenton av ## Winderoth Wyant, shoemaker, Champion pl ## Winderstein Frederick, shoemaker, r 1213 Apple ## Windevender David, shipjoiner, 1021 Ross ## Windish Frederick, tailor, 1129 Charlotte ## Vllindle Benjamin, file manuf. r 70 N 2d ## Windle George, salesm. 633 Market, h 1210 S 10th ## Windle William, superintendent, 1210 S 10th ## Windlerwin Julius, bootfitter, 1225 N 2d ## Windles Richard, carpenter, Oxford n Hedge ## Windner John, brickmaker, 138 Diamond ## Windorf Christian, dealer, r 832 Carpenter ## Windorf Frank, dealer, r 832 Carpenter ## Windrim James H., architect, 1518 Sansom ## Winebaker Wilhelmina, wid Charles, 320 Willow ## Wineberg Samuel, beef butcher, stalls 10 cl: 30 Girard av Market, h 944 St John ## Winebrener (K: Co. (Harry C. Wiiwbrevzer @Freclerick L. Pleis), coal dealers 3d & Thompson ## Winebrener David S., hardware, 49 N 3d, h 1627 Vine ## Winebrener David, merchant, 241 S 18th ## Winebrener Harry C., coal dealer, 3d 6: Thompson, h 241 S 18th ## Wineburg John H., tanner, 535 N Front ## Winegar Francis, cabinetmaker, 117 W'alnut, h 235 Shippen ## Winegardener John, barkeeper, 9th cl: Arch, h 5th & Master ## Winegar-dner Adam, laborer, r Hope 11 Canal ## Winegar-dner Andreas, tailor, 1723 N 3d ## Winegartner Anton, gentleman, 1409 Randolph ## Winehold Benjamin, driver, 1214 S 4th ## Winemore John IL, salesman, 16 S 2d, h 1110 S 2d ## Winfiller Andreas, butcher, 1410 Franklin ## Winfield Charles, shipjoiner, 120 China ## Window Shades and Curtain Goods, \‘Vholcsalc and Retail;
使用str字符串的strip()方法。如果没有参数,它会去掉所有类型的前导和尾随空格。你知道吗
str
可能用\r\n拆分?你知道吗
\r\n
>>> file = open("1128.txt") >>> required_stuff = file.read().split('\r\n') >>> print required_stuff[:10] ['VVIL', '', '1076', '', 'VVIN', '', ' ', '', 'WILSTACH WILLIAM P. & CO. (Wgflliam P.', "TVz'Zsz\xe2\x80\x98ac7z Q\xc2\xbb C/mrles Scott), saddlery hardware,"] >>> file.close()
在我看来,文件的每个“记录”都由两个换行符分隔,其中“换行符”表示DOS样式的行尾,也就是说,一个回车(
'\r'
)后跟一个换行符('\n'
)。因此,我们应该首先对'\r\n\r\n'
上的字节流进行拆分,以获得每个记录一个元素。你知道吗然后,我们可以用
replace()
替换记录中嵌入的不需要的换行符(必须是未配对的)。在我看来,某些一次性嵌入的换行符前面有一个破折号,因此可能需要用空字符串替换'-\r\n'
,以重新连接连字符包装的文本片段,但在此之后,我们应该用单个空格替换任何剩余的未成对换行符。你知道吗因此我们有:
使用
str
字符串的strip()
方法。如果没有参数,它会去掉所有类型的前导和尾随空格。你知道吗可能用
\r\n
拆分?你知道吗相关问题 更多 >
编程相关推荐