Class AbstractParser

java.lang.Object
com.logicaldoc.core.parser.AbstractParser
All Implemented Interfaces:
Parser
Direct Known Subclasses:
AbiWordParser, CatchAllParser, DummyParser, EpubParser, HTMLParser, KOfficeParser, OpenOfficeParser, PDFParser, PPTParser, RarParser, RTFParser, SevenZipParser, TarParser, TXTParser, XLSParser, XMLParser, ZABWParser, ZipParser

public abstract class AbstractParser extends Object implements Parser
Abstract implementation of a Parser
Since:
3.5
Author:
Marco Meschieri - LogicalDOC
  • Constructor Details

    • AbstractParser

      public AbstractParser()
  • Method Details

    • parse

      public String parse(File file, String filename, String encoding, Locale locale, String tenant) throws ParsingException
      Description copied from interface: Parser
      Same as Parser.parse(InputStream, String, String, Locale, String), use this when you have a file rather than a stream.
      Specified by:
      parse in interface Parser
      Parameters:
      file - the file
      filename - name of the file
      encoding - character encoding
      locale - the locale
      tenant - name of the tenant
      Returns:
      the extracted text
      Throws:
      ParsingException - error in the parsing
    • parse

      public String parse(File file, String filename, String encoding, Locale locale, String tenant, Document document, String fileVersion) throws ParsingException
      Description copied from interface: Parser
      Same as Parser.parse(InputStream, ParseParameters), but use this when you have a file rather than a stream.
      Specified by:
      parse in interface Parser
      Parameters:
      file - the file
      filename - name of the file
      encoding - character encoding
      locale - the locale
      tenant - name of the tenant
      document - the document the file belongs to (optional)
      fileVersion - the file version being processed (optional)
      Returns:
      the extracted text
      Throws:
      ParsingException - error in the parsing
    • parse

      public String parse(InputStream input, String filename, String encoding, Locale locale, String tenant) throws ParsingException
      Description copied from interface: Parser
      Extracts content for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments.

      The implementation can choose either to read and parse the given document immediately or to return a reader that does it incrementally. The only constraint is that the implementation must close the given stream latest when the returned reader is closed. The caller on the other hand is responsible for closing the returned reader.

      The implementation should only throw an exception on transient errors, i.e. when it can expect to be able to successfully extract the text content of the same binary at another time. An effort should be made to recover from syntax errors and other similar problems.

      This method should be thread-safe, i.e. it is possible that this method is invoked simultaneously by different threads to extract the text content of different documents. On the other hand the returned reader does not need to be thread-safe.

      The parsing has to be completed before the seconds specified in the parser.timeout config. property.

      Depending on the value of the parser.timeout.retain config. property, the already extracted text is retained or not in case of timeout.

      Specified by:
      parse in interface Parser
      Parameters:
      input - binary content from which to extract the text
      filename - name of the file
      encoding - character encoding
      locale - the locale
      tenant - name of the tenant
      Returns:
      the extracted text
      Throws:
      ParsingException - error in the parsing
    • parse

      public String parse(InputStream input, ParseParameters parameters) throws ParsingException
      Description copied from interface: Parser
      Extracts content for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments.

      The implementation can choose either to read and parse the given document immediately or to return a reader that does it incrementally. The only constraint is that the implementation must close the given stream latest when the returned reader is closed. The caller on the other hand is responsible for closing the returned reader.

      The implementation should only throw an exception on transient errors, i.e. when it can expect to be able to successfully extract the text content of the same binary at another time. An effort should be made to recover from syntax errors and other similar problems.

      This method should be thread-safe, i.e. it is possible that this method is invoked simultaneously by different threads to extract the text content of different documents. On the other hand the returned reader does not need to be thread-safe.

      The parsing has to be completed before the seconds specified in the parser.timeout config. property.

      Depending on the value of the parser.timeout.retain config. property, the already extracted text is retained or not in case of timeout.

      Specified by:
      parse in interface Parser
      Parameters:
      input - binary content from which to extract the text
      parameters - the parameters
      Returns:
      the extracted text
      Throws:
      ParsingException - error in the parsing
    • countPages

      public int countPages(InputStream input, String filename)
      Description copied from interface: Parser
      Counts the number of pages of the given binary document.
      Specified by:
      countPages in interface Parser
      Parameters:
      input - binary content from which to extract the text
      filename - name of the file
      Returns:
      the number of pages
    • countPages

      public int countPages(File file, String filename)
      Description copied from interface: Parser
      Same as the other Parser.countPages(InputStream, String), but use this when you have a file rather than a stream.
      Specified by:
      countPages in interface Parser
      Parameters:
      file - the file
      filename - name of the file
      Returns:
      the number of pages