有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

utf 8字节顺序标记在Java中会导致文件读取出错

我正在尝试使用Java读取CSV文件。有些文件的开头可能有字节顺序标记,但不是全部。当存在时,字节顺序将与第一行的其余部分一起读取,从而导致字符串比较出现问题

有没有一种简单的方法可以跳过字节顺序标记


共 (6) 个答案

  1. # 1 楼答案

    为了简单地从文件中删除BOM字符,我建议使用Apache Common IO

    public BOMInputStream(InputStream delegate,
                  boolean include)
    Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
    Parameters:
    delegate - the InputStream to delegate to
    include - true to include the UTF-8 BOM or false to exclude it
    

    将include设置为false,则BOM表字符将被排除

  2. # 2 楼答案

    编辑:我在GitHub上发布了一个合适的版本:https://github.com/gpakosz/UnicodeBOMInputStream


    这是我不久前编写的一个类,我只是在粘贴之前编辑了包名。没什么特别的,它与SUN的bug数据库中发布的解决方案非常相似。把它加入到你的代码中,你就没事了

    /* ____________________________________________________________________________
     * 
     * File:    UnicodeBOMInputStream.java
     * Author:  Gregory Pakosz.
     * Date:    02 - November - 2005    
     * ____________________________________________________________________________
     */
    package com.stackoverflow.answer;
    
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.PushbackInputStream;
    
    /**
     * The <code>UnicodeBOMInputStream</code> class wraps any
     * <code>InputStream</code> and detects the presence of any Unicode BOM
     * (Byte Order Mark) at its beginning, as defined by
     * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
     * 
     * <p>The
     * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
     * defines 5 types of BOMs:<ul>
     * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
     * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
     * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
     * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
     * <li><pre>EF BB BF     = UTF-8</pre></li>
     * </ul></p>
     * 
     * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
     * or not.
     * </p>
     * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
     * wrapped <code>InputStream</code> object.</p>
     */
    public class UnicodeBOMInputStream extends InputStream
    {
      /**
       * Type safe enumeration class that describes the different types of Unicode
       * BOMs.
       */
      public static final class BOM
      {
        /**
         * NONE.
         */
        public static final BOM NONE = new BOM(new byte[]{},"NONE");
    
        /**
         * UTF-8 BOM (EF BB BF).
         */
        public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                           (byte)0xBB,
                                                           (byte)0xBF},
                                                "UTF-8");
    
        /**
         * UTF-16, little-endian (FF FE).
         */
        public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                                (byte)0xFE},
                                                    "UTF-16 little-endian");
    
        /**
         * UTF-16, big-endian (FE FF).
         */
        public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                                (byte)0xFF},
                                                    "UTF-16 big-endian");
    
        /**
         * UTF-32, little-endian (FF FE 00 00).
         */
        public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                                (byte)0xFE,
                                                                (byte)0x00,
                                                                (byte)0x00},
                                                    "UTF-32 little-endian");
    
        /**
         * UTF-32, big-endian (00 00 FE FF).
         */
        public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                                (byte)0x00,
                                                                (byte)0xFE,
                                                                (byte)0xFF},
                                                    "UTF-32 big-endian");
    
        /**
         * Returns a <code>String</code> representation of this <code>BOM</code>
         * value.
         */
        public final String toString()
        {
          return description;
        }
    
        /**
         * Returns the bytes corresponding to this <code>BOM</code> value.
         */
        public final byte[] getBytes()
        {
          final int     length = bytes.length;
          final byte[]  result = new byte[length];
    
          // Make a defensive copy
          System.arraycopy(bytes,0,result,0,length);
    
          return result;
        }
    
        private BOM(final byte bom[], final String description)
        {
          assert(bom != null)               : "invalid BOM: null is not allowed";
          assert(description != null)       : "invalid description: null is not allowed";
          assert(description.length() != 0) : "invalid description: empty string is not allowed";
    
          this.bytes          = bom;
          this.description  = description;
        }
    
                final byte    bytes[];
        private final String  description;
    
      } // BOM
    
      /**
       * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
       * specified <code>InputStream</code>.
       * 
       * @param inputStream an <code>InputStream</code>.
       * 
       * @throws NullPointerException when <code>inputStream</code> is
       * <code>null</code>.
       * @throws IOException on reading from the specified <code>InputStream</code>
       * when trying to detect the Unicode BOM.
       */
      public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                          IOException
    
      {
        if (inputStream == null)
          throw new NullPointerException("invalid input stream: null is not allowed");
    
        in = new PushbackInputStream(inputStream,4);
    
        final byte  bom[] = new byte[4];
        final int   read  = in.read(bom);
    
        switch(read)
        {
          case 4:
            if ((bom[0] == (byte)0xFF) &&
                (bom[1] == (byte)0xFE) &&
                (bom[2] == (byte)0x00) &&
                (bom[3] == (byte)0x00))
            {
              this.bom = BOM.UTF_32_LE;
              break;
            }
            else
            if ((bom[0] == (byte)0x00) &&
                (bom[1] == (byte)0x00) &&
                (bom[2] == (byte)0xFE) &&
                (bom[3] == (byte)0xFF))
            {
              this.bom = BOM.UTF_32_BE;
              break;
            }
    
          case 3:
            if ((bom[0] == (byte)0xEF) &&
                (bom[1] == (byte)0xBB) &&
                (bom[2] == (byte)0xBF))
            {
              this.bom = BOM.UTF_8;
              break;
            }
    
          case 2:
            if ((bom[0] == (byte)0xFF) &&
                (bom[1] == (byte)0xFE))
            {
              this.bom = BOM.UTF_16_LE;
              break;
            }
            else
            if ((bom[0] == (byte)0xFE) &&
                (bom[1] == (byte)0xFF))
            {
              this.bom = BOM.UTF_16_BE;
              break;
            }
    
          default:
            this.bom = BOM.NONE;
            break;
        }
    
        if (read > 0)
          in.unread(bom,0,read);
      }
    
      /**
       * Returns the <code>BOM</code> that was detected in the wrapped
       * <code>InputStream</code> object.
       * 
       * @return a <code>BOM</code> value.
       */
      public final BOM getBOM()
      {
        // BOM type is immutable.
        return bom;
      }
    
      /**
       * Skips the <code>BOM</code> that was found in the wrapped
       * <code>InputStream</code> object.
       * 
       * @return this <code>UnicodeBOMInputStream</code>.
       * 
       * @throws IOException when trying to skip the BOM from the wrapped
       * <code>InputStream</code> object.
       */
      public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
      {
        if (!skipped)
        {
          in.skip(bom.bytes.length);
          skipped = true;
        }
        return this;
      }
    
      /**
       * {@inheritDoc}
       */
      public int read() throws IOException
      {
        return in.read();
      }
    
      /**
       * {@inheritDoc}
       */
      public int read(final byte b[]) throws  IOException,
                                              NullPointerException
      {
        return in.read(b,0,b.length);
      }
    
      /**
       * {@inheritDoc}
       */
      public int read(final byte b[],
                      final int off,
                      final int len) throws IOException,
                                            NullPointerException
      {
        return in.read(b,off,len);
      }
    
      /**
       * {@inheritDoc}
       */
      public long skip(final long n) throws IOException
      {
        return in.skip(n);
      }
    
      /**
       * {@inheritDoc}
       */
      public int available() throws IOException
      {
        return in.available();
      }
    
      /**
       * {@inheritDoc}
       */
      public void close() throws IOException
      {
        in.close();
      }
    
      /**
       * {@inheritDoc}
       */
      public synchronized void mark(final int readlimit)
      {
        in.mark(readlimit);
      }
    
      /**
       * {@inheritDoc}
       */
      public synchronized void reset() throws IOException
      {
        in.reset();
      }
    
      /**
       * {@inheritDoc}
       */
      public boolean markSupported() 
      {
        return in.markSupported();
      }
    
      private final PushbackInputStream in;
      private final BOM                 bom;
      private       boolean             skipped = false;
    
    } // UnicodeBOMInputStream
    

    你是这样使用它的:

    import java.io.BufferedReader;
    import java.io.FileInputStream;
    import java.io.InputStreamReader;
    
    public final class UnicodeBOMInputStreamUsage
    {
      public static void main(final String[] args) throws Exception
      {
        FileInputStream fis = new FileInputStream("test/offending_bom.txt");
        UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);
    
        System.out.println("detected BOM: " + ubis.getBOM());
    
        System.out.print("Reading the content of the file without skipping the BOM: ");
        InputStreamReader isr = new InputStreamReader(ubis);
        BufferedReader br = new BufferedReader(isr);
    
        System.out.println(br.readLine());
    
        br.close();
        isr.close();
        ubis.close();
        fis.close();
    
        fis = new FileInputStream("test/offending_bom.txt");
        ubis = new UnicodeBOMInputStream(fis);
        isr = new InputStreamReader(ubis);
        br = new BufferedReader(isr);
    
        ubis.skipBOM();
    
        System.out.print("Reading the content of the file after skipping the BOM: ");
        System.out.println(br.readLine());
    
        br.close();
        isr.close();
        ubis.close();
        fis.close();
      }
    
    } // UnicodeBOMInputStreamUsage
    
  3. # 3 楼答案

    Apache Commons IO库有一个InputStream可以检测和丢弃BOM:^{} (javadoc)

    BOMInputStream bomIn = new BOMInputStream(in);
    int firstNonBOMByte = bomIn.read(); // Skips BOM
    if (bomIn.hasBOM()) {
        // has a UTF-8 BOM
    }
    

    如果您还需要检测不同的编码,它还可以区分各种不同的字节顺序标记,例如UTF-8与UTF-16大+小端-详细信息,请参见上面的文档链接。然后,您可以使用检测到的^{}来选择一个^{}来解码流。(如果您需要所有这些功能,可能有一种更精简的方法来实现这一点——可能是BalusC回答中的UnicodeReader?)。请注意,一般来说,没有很好的方法来检测某些字节的编码方式,但是如果流以BOM开头,显然这会很有帮助

    编辑:如果需要检测UTF-16、UTF-32等中的BOM,则构造函数应为:

    new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
            ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)
    

    Upvote@martin charlesworth的评论:)

  4. # 4 楼答案

    更简单的解决方案:

    public class BOMSkipper
    {
        public static void skip(Reader reader) throws IOException
        {
            reader.mark(1);
            char[] possibleBOM = new char[1];
            reader.read(possibleBOM);
    
            if (possibleBOM[0] != '\ufeff')
            {
                reader.reset();
            }
        }
    }
    

    使用示例:

    BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
    BOMSkipper.skip(input);
    //Now UTF prefix not present:
    input.readLine();
    ...
    

    它适用于所有5种UTF编码

  5. # 5 楼答案

    Google Data API有一个^{}自动检测编码

    您可以使用它来代替InputStreamReader。这里有一个稍微精简的源代码摘录,非常简单:

    public class UnicodeReader extends Reader {
        private static final int BOM_SIZE = 4;
        private final InputStreamReader reader;
    
        /**
         * Construct UnicodeReader
         * @param in Input stream.
         * @param defaultEncoding Default encoding to be used if BOM is not found,
         * or <code>null</code> to use system default encoding.
         * @throws IOException If an I/O error occurs.
         */
        public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
            byte bom[] = new byte[BOM_SIZE];
            String encoding;
            int unread;
            PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
            int n = pushbackStream.read(bom, 0, bom.length);
    
            // Read ahead four bytes and check for BOM marks.
            if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
                encoding = "UTF-8";
                unread = n - 3;
            } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
                encoding = "UTF-16BE";
                unread = n - 2;
            } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
                encoding = "UTF-16LE";
                unread = n - 2;
            } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
                encoding = "UTF-32BE";
                unread = n - 4;
            } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
                encoding = "UTF-32LE";
                unread = n - 4;
            } else {
                encoding = defaultEncoding;
                unread = n;
            }
    
            // Unread bytes if necessary and skip BOM marks.
            if (unread > 0) {
                pushbackStream.unread(bom, (n - unread), unread);
            } else if (unread < -1) {
                pushbackStream.unread(bom, 0, 0);
            }
    
            // Use given encoding.
            if (encoding == null) {
                reader = new InputStreamReader(pushbackStream);
            } else {
                reader = new InputStreamReader(pushbackStream, encoding);
            }
        }
    
        public String getEncoding() {
            return reader.getEncoding();
        }
    
        public int read(char[] cbuf, int off, int len) throws IOException {
            return reader.read(cbuf, off, len);
        }
    
        public void close() throws IOException {
            reader.close();
        }
    }
    
  6. # 6 楼答案

    @rescdsk已经提到了Apache Commons IO库的BOMInputStream,但是我没有看到它提到如何在没有BOM的情况下获得InputStream

    下面是我在Scala的表现

     import java.io._
     val file = new File(path_to_xml_file_with_BOM)
     val fileInpStream = new FileInputStream(file)   
     val bomIn = new BOMInputStream(fileInpStream, 
             false); // false means don't include BOM