java.lang.Object
- com.logicaldoc.core.parser.AbstractParser
- - com.logicaldoc.core.parser.RTFParser
  - - com.logicaldoc.core.parser.DOCParser

All Implemented Interfaces:

Parser
```
public class DOCParser
extends RTFParser
```
Parses a MS Word (*.doc, *.dot) file to extract the text contained in the file. This class uses the external library HWPF provided by the Apache Jakarta POI project. Even though this library provides features to extract the document author and version, we do not use those features, because the library is known to be buggy. The important part is to get the text content, not extracting the author, date, etc. is not essential.

Since:

3.5

Author:

Michael Scholz, Sebastian Stein, Alessandro Gasparini - LogicalDOC

Constructor Summary

Constructors
Constructor Description

DOCParser()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`int`	`countPages(InputStream input, String filename)`	Counts the number of pages of the given binary document.
`String`	`parse(InputStream input, String filename, String encoding, Locale locale, String tenant, Document document, String fileVersion)`	Extracts content for the text content of the given binary document.

Methods inherited from class com.logicaldoc.core.parser.RTFParser
countPages, internalParse

Methods inherited from class com.logicaldoc.core.parser.AbstractParser
parse, parse, parse

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - DOCParser
```
public DOCParser()
```
- Method Detail
  - parse
```
public String parse(InputStream input,
                    String filename,
                    String encoding,
                    Locale locale,
                    String tenant,
                    Document document,
                    String fileVersion)
```
    Description copied from interface: Parser
    
    Extracts content for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments.
    The implementation can choose either to read and parse the given document immediately or to return a reader that does it incrementally. The only constraint is that the implementation must close the given stream latest when the returned reader is closed. The caller on the other hand is responsible for closing the returned reader.
    
    The implementation should only throw an exception on transient errors, i.e. when it can expect to be able to successfully extract the text content of the same binary at another time. An effort should be made to recover from syntax errors and other similar problems.
    
    This method should be thread-safe, i.e. it is possible that this method is invoked simultaneously by different threads to extract the text content of different documents. On the other hand the returned reader does not need to be thread-safe.
    
    The parsing has to be completed before the seconds specified in the parser.timeout config. property.
    
    Specified by:
    
    parse in interface Parser
    
    Overrides:
    
    parse in class AbstractParser
    
    Parameters:
    
    input - binary content from which to extract the text
    
    filename - name of the file
    
    encoding - character encoding
    
    locale - the locale
    
    tenant - name of the tenant
    
    document - the document the file belongs to (optional)
    
    fileVersion - the file version being processed (optional)
    
    Returns:
    
    the extracted text
  - countPages
```
public int countPages(InputStream input,
                      String filename)
```
    Description copied from interface: Parser
    
    Counts the number of pages of the given binary document.
    
    Specified by:
    
    countPages in interface Parser
    
    Overrides:
    
    countPages in class RTFParser
    
    Parameters:
    
    input - binary content from which to extract the text
    
    filename - name of the file
    
    Returns:
    
    the number of pages

Class DOCParser

Constructor Summary

Method Summary

Methods inherited from class com.logicaldoc.core.parser.RTFParser

Methods inherited from class com.logicaldoc.core.parser.AbstractParser

Methods inherited from class java.lang.Object

Constructor Detail

DOCParser

Method Detail

parse

countPages