Class StringExtractor

  • Direct Known Subclasses:
    WPStringExtractor

    public class StringExtractor
    extends Object
    StringExtractor uses a set of heuristics to extract as much human-readable text as possible from a binary stream. This is useful for binary document formats that often or always contain the document text as ascii characters (e.g. MS Office files), intermixed with binary parts. When such a document could not be parsed using the appropriate library (e.g., Apache POI), a StringExtractor might still be able to produce some meaningful content and can thus serve as a fallback.

    The output of StringExtractor is suited for text indexing but less for human consumption, as any formatting will most likely be lost and some amount of unwanted characters slipping through can also not be prevented.

    • Field Detail

      • COMMON_FONT_NAMES

        public static final String[] COMMON_FONT_NAMES
    • Constructor Detail

      • StringExtractor

        public StringExtractor()
    • Method Detail

      • extract

        public String extract​(InputStream stream)
                       throws IOException
        Extract all human-readable text from an InputStream.
        Parameters:
        stream - The InputStream to read the bytes from. The stream will be fully consumed but not closed.
        Returns:
        The resulting, heuristically determined text. A String is always returned, although it can be empty.
        Throws:
        IOException - When reading characters from the InputStream caused an IOException.