如何将文本清理步骤压缩为单个Python函数？

import string import pandas as pd data = ["West Georgia Co", "W.B. Carell Clockmakers", "Spine & Orthopedic LLC", "LRHS Saint Jose's Grocery", "Optitech@NYCityScape"] df = pd.DataFrame(data, columns = ['co_name']) def remove_punctuations(text): for punctuation in string.punctuation: text = text.replace(punctuation, '') return text # applying remove_punctuations function df['co_name_transform'] = df['co_name'].apply(remove_punctuations) # this next step replaces 'Saint' with 'st' to standardize, # and I may want to make other substitutions but this is a common one. df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st') # replace whitespace df['co_name_transform'] = df.co_name_transform.str.replace(' ', '') # make lowercase df['co_name_transform'] = df.co_name_transform.str.lower() # select first 0:10 of strings df['co_name_transform'] = df.co_name_transform.str[0:10] print(df)

co_name co_name_transform 0 West Georgia Co westgeorgi 1 W.B. Carell Clockmakers wbcarellcl 2 Spine & Orthopedic LLC spineortho 3 LRHS Saint Jose's Grocery lrhsstjose 4 Optitech@NYCityScape optitechny

3条回答

网友

1楼 · 编辑于 2024-09-28 17:19:33

另一种解决方案，与前一种类似，但是在一个字典中有“to_replace”列表，因此您可以添加更多要替换的项。另外，前面的解决方案不会给出前10个。你知道吗

data = ["West Georgia Co", 
        "W.B. Carell Clockmakers", 
        "Spine & Orthopedic LLC",
        "LRHS Saint Jose's Grocery",
        "Optitech@NYCityScape","Optitech@NYCityScape","Optitech@NYCityScape","Optitech@NYCityScape","Optitech@NYCityScape","Optitech@NYCityScape","Optitech@NYCityScape","Optitech@NYCityScape","Optitech@NYCityScape"]

    df = pd.DataFrame(data, columns = ['co_name'])

    to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}

    for i in to_replace : 
        df['co_name'] =  df['co_name'].str.replace(i,to_replace[i]).str.lower()
    df['co_name'][0:10]

结果：

0            westgeorgiaco
1      wbcarellclockmakers
2       spineorthopedicllc
3    lrhssaintjosesgrocery
4      optitechnycityscape
5      optitechnycityscape
6      optitechnycityscape
7      optitechnycityscape
8      optitechnycityscape
9      optitechnycityscape
Name: co_name, dtype: object

上一个解决方案（不显示前10个）

df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]

结果：

0     westgeorgi
1     wbcarellcl
2     spineortho
3     lrhssaintj
4     optitechny
5     optitechny
6     optitechny
7     optitechny
8     optitechny
9     optitechny
10    optitechny
11    optitechny
12    optitechny
Name: co_name_transform, dtype: object

网友

2楼 · 编辑于 2024-09-28 17:19:33

这样做不需要函数。试试下面的一行。你知道吗

df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]

最终输出为。你知道吗

                     co_name co_name_transform
0            West Georgia Co        westgeorgi
1    W.B. Carell Clockmakers        wbcarellcl
2     Spine & Orthopedic LLC        spineortho
3  LRHS Saint Jose's Grocery        lrhsstjose
4       Optitech@NYCityScape        optitechny

网友

3楼 · 编辑于 2024-09-28 17:19:33

您可以执行传递给apply方法的函数中的所有步骤：

import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])

相关问题更多 >

编程相关推荐

热门问题

热门文章