Sunday, September 7, 2008

Extracting text from documents using Java

Extracting the text from the HTM/HTML document i.e. pulling all the text except the HTML tags is done as:
URL url = new URL("http://localhost:8080/index.jsp");  
EditorKit kit = new HTMLEditorKit();  
Document document = kit.createDefaultDocument();, document, 0);  
System.out.println(document.getText(0, document.getLength()));
For PDF text extraction use pdfbox from
URL url = new URL("http://localhost:8080/Document.pdf");  
PDDocument document = PDDocument.load(url.openStream());  
PDFTextStripper pdfStripper = new PDFTextStripper();  pdfStripper.setSortByPosition(false);  
pdfStripper.setStartPage(1); //from which page to start  
pdfStripper.setEndPage(3);  //on which page to end
For MS Office Suite documents text extraction use Apache POI ( ) Sample code for extracting text from a .doc file is as follows:-
POIFSFileSystem doc = new POIFSFileSystem(new FileInputStream("c:/Resume.doc"));
WordExtractor extractor = new WordExtractor( doc );
A good article on the same can be found at

No comments:

Post a Comment