如何使用Pyspark和NLTK计算POS标签?

2024-09-30 01:32:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些文本或一个大文件,我需要使用NLTK和Pyspark来计算POS标签的数量。我找不到导入文本文件的方法,因此我尝试添加一个短字符串,但失败了

计数线需要包含pyspark

##textfile = sc.textfile('') 
##or 
##textstring = """This is just a bunch of words to use for this example.  John gave ##them to me last night but Kim took them to work.  Hi Stacy.  ###'''URL:http://example.com'''"""

tstring = sc.parallelize(List(textstring)).collect()

TOKEN_RE = re.compile(r"\b[\w']+\b")

dropURL=text.filter(lambda x: "URL" not in x)

words = dropURL.flatMap(lambda dropURL: dropURL.split(" "))

nltkwords = words.flatMap(lambda words: nltk.tag.pos_tag(nltk.regexp_tokenize(words, TOKEN_RE)))
#word_counts =nltkwords.map(lambda nltkwords: (ntlkwords,1))


nltkwords.take(50)

Tags: tolambdaretokenurlexamplewordssc
1条回答
网友
1楼 · 发布于 2024-09-30 01:32:11

下面是一个测试字符串的示例。我想你只是错过了一个按空格分割字符串的步骤。否则整行将被删除,因为URL位于该行中

import nltk
import re

textstring = """This is just a bunch of words to use for this example.  John gave ##them to me last night but Kim took them to work.  Hi Stacy.  ###'''URL:http://example.com'''"""

TOKEN_RE = re.compile(r"\b[\w']+\b")
text = sc.parallelize(textstring.split(' '))
dropURL = text.filter(lambda x: "URL" not in x)

words = dropURL.flatMap(lambda dropURL: dropURL.split(" "))

nltkwords = words.flatMap(lambda words: nltk.tag.pos_tag(nltk.regexp_tokenize(words, TOKEN_RE)))

nltkwords.collect()
# [('This', 'DT'), ('is', 'VBZ'), ('just', 'RB'), ('a', 'DT'), ('bunch', 'NN'), ('of', 'IN'), ('words', 'NNS'), ('to', 'TO'), ('use', 'NN'), ('for', 'IN'), ('this', 'DT'), ('example', 'NN'), ('John', 'NNP'), ('gave', 'VBD'), ('them', 'PRP'), ('to', 'TO'), ('me', 'PRP'), ('last', 'JJ'), ('night', 'NN'), ('but', 'CC'), ('Kim', 'NNP'), ('took', 'VBD'), ('them', 'PRP'), ('to', 'TO'), ('work', 'NN'), ('Hi', 'NN'), ('Stacy', 'NN')]

要统计pos标记的出现次数,您可以执行reduceByKey:

word_counts = nltkwords.map(lambda x: (x[1], 1)).reduceByKey(lambda x, y: x + y)

word_counts.collect()
# [('NNS', 1), ('TO', 3), ('CC', 1), ('DT', 3), ('JJ', 1), ('VBZ', 1), ('RB', 1), ('NN', 7), ('VBD', 2), ('PRP', 3), ('IN', 2), ('NNP', 2)]

相关问题 更多 >

    热门问题