Showing posts with label PDFBox. Show all posts
Showing posts with label PDFBox. Show all posts

Sunday, September 7, 2008

Extracting text from documents using Java

Extracting the text from the HTM/HTML document i.e. pulling all the text except the HTML tags is done as:
URL url = new URL("http://localhost:8080/index.jsp");  
EditorKit kit = new HTMLEditorKit();  
Document document = kit.createDefaultDocument();  
kit.read(url.openStream(), document, 0);  
System.out.println(document.getText(0, document.getLength()));
For PDF text extraction use pdfbox from www.pdfbox.org
URL url = new URL("http://localhost:8080/Document.pdf");  
PDDocument document = PDDocument.load(url.openStream());  
PDFTextStripper pdfStripper = new PDFTextStripper();  pdfStripper.setSortByPosition(false);  
pdfStripper.setStartPage(1); //from which page to start  
pdfStripper.setEndPage(3);  //on which page to end
System.out.println(pdfStripper.getText(document));  
document.close();
For MS Office Suite documents text extraction use Apache POI ( http://poi.apache.org/ ) Sample code for extracting text from a .doc file is as follows:-
POIFSFileSystem doc = new POIFSFileSystem(new FileInputStream("c:/Resume.doc"));
WordExtractor extractor = new WordExtractor( doc );
System.out.println(extractor.getText());
A good article on the same can be found at http://www.informit.com/guides/content.aspx?g=java&seqNum=354