HDInsight群集中的UTF8文本出现spark结果编码错误“ascii”编解码器无法在位置编码字符:序号不在范围内(128)

2024-09-29 21:27:35 发布

您现在位置:Python中文网/ 问答频道 /正文

在Linux上使用spark在HDInsight集群中尝试使用希伯来语字符UTF-8TSV文件,我得到了编码错误,有什么建议吗?在

这是我的pyspark笔记本代码:

from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))

# Infer the schema and create a table       
transactionsTable = sqlContext.createDataFrame(transactions)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")

# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)

for name in names.collect():
  print(name)

错误:

'ascii' codec can't encode characters in position 6-11: ordinal not in range(128) Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-11: ordinal not in range(128)

希伯来文文本文件内容:

^{pr2}$

当我尝试使用英语文件时,效果很好:

英文文本文件内容:

id  name    age
1   guy     37
2   maor    32
3   danny   55

输出:

name: guy
name: maor
name: danny

Tags: and文件thelambdanameinmapdata
1条回答
网友
1楼 · 发布于 2024-09-29 21:27:35

如果使用希伯来语文本运行以下代码:

from pyspark.sql import *

path = "/people.txt"
transactionsText = sc.textFile(path)

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))

transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))

transactions.collect()

您会注意到,这些名称是unicode类型的列表:

^{pr2}$

现在,我们用事务RDD注册一个表:

table_name = "transactionsTempTable"

# Infer the schema and create a table       
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))

results.collect()

您会注意到,从sqlContext.sql(...返回的PysparkDataFrame中的所有字符串都是Pythonunicode类型:

[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]

正在运行:

%%sql
SELECT * FROM transactionsTempTable

将获得预期结果:

name: גיא
name: maor
name: danny

请注意,如果您想对这些名称进行一些处理,您应该将它们作为unicode字符串来处理。来自this article

When you’re dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode strings as they abstract characters in a manner that’s appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.

相关问题 更多 >

    热门问题