有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java Hadoop Reduce输出文件从未为大数据创建过

我正在Hadoop1.1.1(Ubuntu)上用Java编写一个应用程序,它比较字符串以找到最长的公共子字符串。我已经成功地为小数据集运行了map和reduce阶段。每当我增加输入的大小时,我的reduce输出就不会出现在我的目标输出目录中。它一点也不抱怨,这让这一切变得更加奇怪。我在Eclipse中运行一切,我有一个映射器和一个还原器

我的reducer在字符串集合中查找最长的公共子字符串,然后将该子字符串作为键发出,并将包含它的字符串的索引作为值发出。我有一个简短的例子

输入数据

0: ALPHAA

1: ALPHAB

2: ALZHA

发射的输出

Key: ALPHA  Value: 0

Key: ALPHA  Value: 1

Key: AL  Value: 0

Key: AL  Value: 1

Key: AL  Value: 2

前两个输入字符串都共享“ALPHA”作为公共子字符串,而所有三个输入字符串都共享“AL”。我最终为子字符串编制索引,并在过程完成后将其写入数据库

另外一个观察结果是,我可以看到中间文件是在我的输出目录中创建的,只是简化后的数据从未放入输出文件中

我在下面粘贴了Hadoop输出日志,它声称它有许多来自reducer的输出记录,只是它们似乎消失了。如有任何建议,我们将不胜感激

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool     for the same.
No job jar file set.  User classes may not be found. See JobConf(Class) or     JobConf#setJar(String).
Total input paths to process : 1
Running job: job_local_0001
setsid exited with exit code 0
 Using ResourceCalculatorPlugin :     org.apache.hadoop.util.LinuxResourceCalculatorPlugin@411fd5a3
Snappy native library not loaded
io.sort.mb = 100
data buffer = 79691776/99614720
record buffer = 262144/327680
 map 0% reduce 0%
Spilling map output: record full = true
bufstart = 0; bufend = 22852573; bufvoid = 99614720
kvstart = 0; kvend = 262144; length = 327680
Finished spill 0
Starting flush of map output
Finished spill 1
Merging 2 sorted segments
Down to the last merge-pass, with 2 segments left of total size: 28981648 bytes

Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

Task attempt_local_0001_m_000000_0 done.
 Using ResourceCalculatorPlugin :     org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3aff2f16

Merging 1 sorted segments
Down to the last merge-pass, with 1 segments left of total size: 28981646 bytes

 map 100% reduce 0%
reduce > reduce
 map 100% reduce 66%
reduce > reduce
 map 100% reduce 67%
reduce > reduce
reduce > reduce
 map 100% reduce 68%
reduce > reduce
reduce > reduce
reduce > reduce
 map 100% reduce 69%
reduce > reduce
reduce > reduce
 map 100% reduce 70%
reduce > reduce
job_local_0001
Job complete: job_local_0001
Counters: 22
  File Output Format Counters 
    Bytes Written=14754916
  FileSystemCounters
    FILE_BYTES_READ=61475617
    HDFS_BYTES_READ=97361881
    FILE_BYTES_WRITTEN=116018418
    HDFS_BYTES_WRITTEN=116746326
  File Input Format Counters 
    Bytes Read=46366176
  Map-Reduce Framework
    Reduce input groups=27774
    Map output materialized bytes=28981650
    Combine output records=0
    Map input records=4629524
    Reduce shuffle bytes=0
    Physical memory (bytes) snapshot=0
    Reduce output records=832559
    Spilled Records=651304
    Map output bytes=28289481
    CPU time spent (ms)=0
    Total committed heap usage (bytes)=2578972672
    Virtual memory (bytes) snapshot=0
    Combine input records=0
    Map output records=325652
    SPLIT_RAW_BYTES=136
    Reduce input records=27774
reduce > reduce
reduce > reduce

共 (1) 个答案

  1. # 1 楼答案

    我将reduce()和map()逻辑放在一个try-catch块中,catch块递增一个组为“Exception”且名称为Exception消息的计数器。这给了我一个快速的方法(通过查看计数器列表)来查看抛出了哪些异常(如果有的话)