删除单引号同时保留撇号Python，NLTK

import nltk raw = open('file_name.txt', 'r').read() output = open('output_filename.csv','w') txt = raw.lower() pattern = r'''(?x)([A_Z]\.)+|\w+(-\w+)*|\.\.\|[][.,;"'?():-_`]''' tokenized = nltk.regexp_tokenize(txt,pattern)

txt = txt.replace("\n", " ") #formats the text so that the line break counts as a space txt = txt.replace("”", " ") #replaces stray quotation marks with a space txt = txt.replace("“", " ") #replaces stray quotation marks with a space txt = txt.replace(" ’", " ") #replaces a right leaning apostrophe with a space if it follows a space(which now includes line breaks) txt = txt.replace(" ‘", " ") #replaces a left leaning apostrophe with a space if it follows a space

1条回答

网友

1楼 · 发布于 2024-06-01 12:46:46

与其替换标点符号，不如在空格上split，然后在每个单词的开头和结尾处strip使用标点符号：

>>> import string
>>> phrase = "'This has punctuation, and it's hard to remove!'"
>>> [word.strip(string.punctuation) for word in phrase.split(" ")]
['This', 'has', 'punctuation', 'and', "it's", 'hard', 'to', 'remove']

这将在单词中保留撇号和连字符，同时删除单词开头或结尾的标点符号。

请注意，独立标点将被空字符串""替换：

>>> phrase = "This is - no doubt - punctuated"
>>> [word.strip(string.punctuation) for word in phrase.split(" ")]
['This', 'is', '', 'no', 'doubt', '', 'punctuated']

这很容易过滤掉，因为空字符串的计算结果是False：

filtered = [f for f in txt if f and f.lower() not in stopwords]
                            # ^ excludes empty string

相关问题更多 >

编程相关推荐

热门问题

热门文章