All Known Implementing Classes:: AbiWordParser, AbstractParser, CatchAllParser, DOCParser, DummyParser, EpubParser, HTMLParser, KOfficeParser, MarkdownParser, OpenOfficeParser, PDFParser, PPTParser, RarParser, RTFParser, SevenZipParser, TarParser, TXTParser, XLSParser, XMLParser, ZABWParser, ZipParser

public interface Parser

A Parser is capable of parsing a content in order to extract the texts and other metadata within it. Concrete implementations of this interface are designed to process specific file types.

Author:: Michael Scholz

Method Summary

Modifier and Type

Method

Description

int

countPages(File file, String filename)

Same as the other countPages(InputStream, String), but use this when you have a file rather than a stream.

int

countPages(InputStream input, String filename)

Counts the number of pages of the given binary document.

String

parse(File file, String filename, String encoding, Locale locale, String tenant)

Same as parse(InputStream, String, String, Locale, String), use this when you have a file rather than a stream.

String

parse(File file, String filename, String encoding, Locale locale, String tenant, Document document, String fileVersion)

Same as parse(InputStream, ParseParameters), but use this when you have a file rather than a stream.

String

parse(InputStream input, ParseParameters parameterObject)

Extracts content for the text content of the given binary document.

String

parse(InputStream input, String filename, String encoding, Locale locale, String tenant)

Extracts content for the text content of the given binary document.

Method Details
- parse
  
  String parse(File file, String filename, String encoding, Locale locale, String tenant) throws ParsingException
  
  Same as parse(InputStream, String, String, Locale, String), use this when you have a file rather than a stream.
  
  Parameters:
  
  file - the file
  
  filename - name of the file
  
  encoding - character encoding
  
  locale - the locale
  
  tenant - name of the tenant
  
  Returns:
  
  the extracted text
  
  Throws:
  
  ParsingException - error in the parsing
- parse
  
  String parse(File file, String filename, String encoding, Locale locale, String tenant, Document document, String fileVersion) throws ParsingException
  
  Same as parse(InputStream, ParseParameters), but use this when you have a file rather than a stream.
  
  Parameters:
  
  file - the file
  
  filename - name of the file
  
  encoding - character encoding
  
  locale - the locale
  
  tenant - name of the tenant
  
  document - the document the file belongs to (optional)
  
  fileVersion - the file version being processed (optional)
  
  Returns:
  
  the extracted text
  
  Throws:
  
  ParsingException - error in the parsing
- parse
  
  String parse(InputStream input, ParseParameters parameterObject) throws ParsingException
  
  Extracts content for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments.
  The implementation can choose either to read and parse the given document immediately or to return a reader that does it incrementally. The only constraint is that the implementation must close the given stream latest when the returned reader is closed. The caller on the other hand is responsible for closing the returned reader.
  
  The implementation should only throw an exception on transient errors, i.e. when it can expect to be able to successfully extract the text content of the same binary at another time. An effort should be made to recover from syntax errors and other similar problems.
  
  This method should be thread-safe, i.e. it is possible that this method is invoked simultaneously by different threads to extract the text content of different documents. On the other hand the returned reader does not need to be thread-safe.
  
  The parsing has to be completed before the seconds specified in the parser.timeout config. property.
  
  Depending on the value of the parser.timeout.retain config. property, the already extracted text is retained or not in case of timeout.
  
  Parameters:
  
  input - binary content from which to extract the text
  
  parameterObject - the parameters
  
  Returns:
  
  the extracted text
  
  Throws:
  
  ParsingException - error in the parsing
- parse
  
  String parse(InputStream input, String filename, String encoding, Locale locale, String tenant) throws ParsingException
  
  Extracts content for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments.
  The implementation can choose either to read and parse the given document immediately or to return a reader that does it incrementally. The only constraint is that the implementation must close the given stream latest when the returned reader is closed. The caller on the other hand is responsible for closing the returned reader.
  
  The implementation should only throw an exception on transient errors, i.e. when it can expect to be able to successfully extract the text content of the same binary at another time. An effort should be made to recover from syntax errors and other similar problems.
  
  This method should be thread-safe, i.e. it is possible that this method is invoked simultaneously by different threads to extract the text content of different documents. On the other hand the returned reader does not need to be thread-safe.
  
  The parsing has to be completed before the seconds specified in the parser.timeout config. property.
  
  Depending on the value of the parser.timeout.retain config. property, the already extracted text is retained or not in case of timeout.
  
  Parameters:
  
  input - binary content from which to extract the text
  
  filename - name of the file
  
  encoding - character encoding
  
  locale - the locale
  
  tenant - name of the tenant
  
  Returns:
  
  the extracted text
  
  Throws:
  
  ParsingException - error in the parsing
- countPages
  
  int countPages(InputStream input, String filename)
  
  Counts the number of pages of the given binary document.
  
  Parameters:
  
  input - binary content from which to extract the text
  
  filename - name of the file
  
  Returns:
  
  the number of pages
- countPages
  
  int countPages(File file, String filename)
  
  Same as the other countPages(InputStream, String), but use this when you have a file rather than a stream.
  
  Parameters:
  
  file - the file
  
  filename - name of the file
  
  Returns:
  
  the number of pages

Interface Parser

Method Summary

Method Details

parse

parse

parse

parse

countPages

countPages