HDInsight群集中的UTF8文本出现spark结果编码错误“ascii”编解码器无法在位置编码字符：序号不在范围内（128）

from pyspark.sql import * # Create an RDD from sample data transactionsText = sc.textFile("/people.txt") header = transactionsText.first() # Create a schema for our data Entry = Row('id','name','age') # Parse the data and create a schema transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t")) transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2]))) # Infer the schema and create a table transactionsTable = sqlContext.createDataFrame(transactions) # SQL can be run over DataFrames that have been registered as a table. results = sqlContext.sql("SELECT name FROM transactionsTempTable") # The results of SQL queries are RDDs and support all the normal RDD operations. names = results.map(lambda p: "name: " + p.name) for name in names.collect(): print(name)

1条回答

网友

1楼 · 发布于 2024-09-29 21:27:35

如果使用希伯来语文本运行以下代码：

from pyspark.sql import *

path = "/people.txt"
transactionsText = sc.textFile(path)

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))

transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))

transactions.collect()

您会注意到，这些名称是unicode类型的列表：

^{pr2}$

现在，我们用事务RDD注册一个表：

table_name = "transactionsTempTable"

# Infer the schema and create a table       
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))

results.collect()

您会注意到，从sqlContext.sql(...返回的PysparkDataFrame中的所有字符串都是Pythonunicode类型：

[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]

正在运行：

%%sql
SELECT * FROM transactionsTempTable

将获得预期结果：

name: גיא
name: maor
name: danny

请注意，如果您想对这些名称进行一些处理，您应该将它们作为unicode字符串来处理。来自this article：

When you’re dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode strings as they abstract characters in a manner that’s appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.

相关问题更多 >

编程相关推荐

热门问题

热门文章