Class StringExtractor
- java.lang.Object
-
- com.logicaldoc.core.parser.wordperfect.StringExtractor
-
- Direct Known Subclasses:
WPStringExtractor
public class StringExtractor extends Object
StringExtractor uses a set of heuristics to extract as much human-readable text as possible from a binary stream. This is useful for binary document formats that often or always contain the document text as ascii characters (e.g. MS Office files), intermixed with binary parts. When such a document could not be parsed using the appropriate library (e.g., Apache POI), a StringExtractor might still be able to produce some meaningful content and can thus serve as a fallback.The output of StringExtractor is suited for text indexing but less for human consumption, as any formatting will most likely be lost and some amount of unwanted characters slipping through can also not be prevented.
-
-
Field Summary
Fields Modifier and Type Field Description static String[]
COMMON_FONT_NAMES
-
Constructor Summary
Constructors Constructor Description StringExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description String
extract(InputStream stream)
Extract all human-readable text from an InputStream.
-
-
-
Field Detail
-
COMMON_FONT_NAMES
public static final String[] COMMON_FONT_NAMES
-
-
Method Detail
-
extract
public String extract(InputStream stream) throws IOException
Extract all human-readable text from an InputStream.- Parameters:
stream
- The InputStream to read the bytes from. The stream will be fully consumed but not closed.- Returns:
- The resulting, heuristically determined text. A String is always returned, although it can be empty.
- Throws:
IOException
- When reading characters from the InputStream caused an IOException.
-
-