如何从python中的列表中获取unigram（单词）？

2条回答

网友

1楼 · 编辑于 2024-09-28 17:25:52

如果需要根据条件从列表中删除元素，可以使用^{}或list comprehension。在

您得到了检查非unigram单词的想法：" " in word。在

基本上，如果您想使用for循环构造一个新列表，您可以编写如下内容：

new_list = []
for word in words:
    if " " in word:  # This is not an unigram word
        new_list.append(word)

由于Python语法，这可能更简单：

^{pr2}$

或者：

new_list = list(filter(lambda word: " " in word, words))

两者都将返回非unigram单词的列表，如问题标题中所述（即使示例返回unigram单词…）

网友

2楼 · 编辑于 2024-09-28 17:25:52

这些字符串不包含一个单词，例如“蒸发”和“阳光”单字？在我看来，你想保留unigrams，而不是删除它们。在

您可以使用列表理解来实现：

list1 = ['water vapor','evaporation','carbon dioxide','sunlight','green plants']
unigrams = [word for word in list1 if ' ' not in word]

>>> print unigrams
['evaporation', 'sunlight']

这假设单词被一个或多个空格隔开。对于n>；1的n-gram，这可能过于简单化了，因为不同的空白字符可以分隔单词，例如制表符、换行符、各种空白unicode代码点等。您可以使用regular expression：

^{pr2}$

模式^\S+$表示从字符串开始到字符串结尾匹配所有非空白字符。在

如果需要支持unicode空格，可以在编译模式时指定unicode标志：

list1.extend([u'punctuation\u2008space', u'NO-BREAKu\u00a0SPACE'])
unigram_pattern = re.compile('^\S+$', re.UNICODE)
unigrams = [word for word in list1 if unigram_pattern.match(word)]

>>> print unigrams
['evaporation', 'sunlight']

现在，它还将过滤掉那些包含unicode空格的字符串，例如不间断空格（U+00A0）和标点符号空格（U+2008）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章