Extracting the text from the HTM/HTML document i.e. pulling all the text except the HTML tags is done as:
URL url = new URL("http://localhost:8080/index.jsp");
EditorKit kit = new HTMLEditorKit();
Document document = kit.createDefaultDocument();
kit.read(url.openStream(), document, 0);
System.out.println(document.getText(0, document.getLength()));
For PDF text extraction use pdfbox from www.pdfbox.org
URL url = new URL("http://localhost:8080/Document.pdf");
PDDocument document = PDDocument.load(url.openStream());
PDFTextStripper pdfStripper = new PDFTextStripper(); pdfStripper.setSortByPosition(false);
pdfStripper.setStartPage(1); //from which page to start
pdfStripper.setEndPage(3); //on which page to end
System.out.println(pdfStripper.getText(document));
document.close();
For MS Office Suite documents text extraction use Apache POI ( http://poi.apache.org/ )
Sample code for extracting text from a .doc file is as follows:-
POIFSFileSystem doc = new POIFSFileSystem(new FileInputStream("c:/Resume.doc"));
WordExtractor extractor = new WordExtractor( doc );
System.out.println(extractor.getText());
A good article on the same can be found at http://www.informit.com/guides/content.aspx?g=java&seqNum=354
No comments:
Post a Comment