Sunday, September 7, 2008

Extracting text from documents using Java

Extracting the text from the HTM/HTML document i.e. pulling all the text except the HTML tags is done as:
URL url = new URL("http://localhost:8080/index.jsp");  
EditorKit kit = new HTMLEditorKit();  
Document document = kit.createDefaultDocument();  
kit.read(url.openStream(), document, 0);  
System.out.println(document.getText(0, document.getLength()));
For PDF text extraction use pdfbox from www.pdfbox.org
URL url = new URL("http://localhost:8080/Document.pdf");  
PDDocument document = PDDocument.load(url.openStream());  
PDFTextStripper pdfStripper = new PDFTextStripper();  pdfStripper.setSortByPosition(false);  
pdfStripper.setStartPage(1); //from which page to start  
pdfStripper.setEndPage(3);  //on which page to end
System.out.println(pdfStripper.getText(document));  
document.close();
For MS Office Suite documents text extraction use Apache POI ( http://poi.apache.org/ ) Sample code for extracting text from a .doc file is as follows:-
POIFSFileSystem doc = new POIFSFileSystem(new FileInputStream("c:/Resume.doc"));
WordExtractor extractor = new WordExtractor( doc );
System.out.println(extractor.getText());
A good article on the same can be found at http://www.informit.com/guides/content.aspx?g=java&seqNum=354

No comments:

Post a Comment