Java Geeks: PDFBox

Sunday, September 7, 2008

Extracting text from documents using Java

Extracting the text from the HTM/HTML document i.e. pulling all the text except the HTML tags is done as:

URL url = new URL("http://localhost:8080/index.jsp");  
EditorKit kit = new HTMLEditorKit();  
Document document = kit.createDefaultDocument();  
kit.read(url.openStream(), document, 0);  
System.out.println(document.getText(0, document.getLength()));

For PDF text extraction use pdfbox from www.pdfbox.org

URL url = new URL("http://localhost:8080/Document.pdf");  
PDDocument document = PDDocument.load(url.openStream());  
PDFTextStripper pdfStripper = new PDFTextStripper();  pdfStripper.setSortByPosition(false);  
pdfStripper.setStartPage(1); //from which page to start  
pdfStripper.setEndPage(3);  //on which page to end
System.out.println(pdfStripper.getText(document));  
document.close();

For MS Office Suite documents text extraction use Apache POI ( http://poi.apache.org/ ) Sample code for extracting text from a .doc file is as follows:-

POIFSFileSystem doc = new POIFSFileSystem(new FileInputStream("c:/Resume.doc"));
WordExtractor extractor = new WordExtractor( doc );
System.out.println(extractor.getText());

A good article on the same can be found at http://www.informit.com/guides/content.aspx?g=java&seqNum=354

Java Geeks

Sunday, September 7, 2008

Extracting text from documents using Java

Labels

Blog Archive

My Blog List