Jsoup is to HTML, what XML parsers are to XML. It parses HTML; real world HTML. Its jquery like selector syntax is very easy to use and very flexible to get the desired result. In this tutorial, we will go through a lot of examples of Jsoup.
Table of Contents What all you can achieve with Jsoup? Runtime Dependencies Main classes you should know Loading a Document Get title from HTML Get Fav icon of HTML page Get all links in HTML page Get all images in HTML page Get meta information of URL Get form attributes in html page Update attributes/content of elements Sanitize untrusted HTML (to prevent XSS)
What all you can achieve with Jsoup?
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
Runtime Dependencies
You can include Jsoup jars into your project using below maven dependency.
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.8.2</version> </dependency>
Or you can directly download jsoup-1.8.2.jar from jsoup.org website and add it to project’s lib folder.
Main classes you should know
Though there are many classes in complete library, but mostly you will be dealing with below given 3 classes. let’s look at them
org.jsoup.Jsoup
Jsoup class will be entry point for any program and will provide methods are loading and parsing HTML documents from variety of sources.
Some important methods of Jsoup class are given below:
[su_table]
MethodDescriptionstatic Connection connect(String url)create and returns connection of URL.static Document parse(File in, String charsetName)parses the specified charset file into document.static Document parse(String html)parses the given html code into document.static String clean(String bodyHtml, Whitelist whitelist)returns safe HTML from input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.[/su_table]
org.jsoup.nodes.Document
This class represent an HTML document loaded through Jsoup library. You can use this class to perform operations that should be applicable on whole HTML document.
Important methods of Element class can be looked at https://jsoup.org/apidocs/org/jsoup/nodes/Document.html.
org.jsoup.nodes.Element
As you know that an HTML element consists of a tag name, attributes, and child nodes. Using Element class, you can extract data, traverse the node graph, and manipulate the HTML.
Important methods of Element class can be looked at https://jsoup.org/apidocs/org/jsoup/nodes/Element.html.
Now let’s look at some examples to work with HTML documents using Jsoup APIs.
Loading a Document
Load a document from URL
Use Jsoup.connect() method to load HTML from a URL.
try { Document document = Jsoup.connect("//howtodoinjava.com").get(); System.out.println(document.title()); } catch (IOException e) { e.printStackTrace(); }
Load a document from File
Use Jsoup.parse() method to load HTML from a file.
try { Document document = Jsoup.parse( new File( "c:/temp/demo.html" ) , "utf-8" ); System.out.println(document.title()); } catch (IOException e) { e.printStackTrace(); }
Load a document from String
Use Jsoup.parse() method to load HTML from a string.
try { String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document document = Jsoup.parse(html); System.out.println(document.title()); } catch (IOException e) { e.printStackTrace(); }
Get title from HTML
As shown above, call document.title() method to get the title of HTML page.
try { Document document = Jsoup.parse( new File("C:/Users/xyz/Desktop/howtodoinjava.html"), "utf-8"); System.out.println(document.title()); } catch (IOException e) { e.printStackTrace(); }
Get Fav icon of HTML page
Assuming that favicon image will be first image in <head>
section of HTML document, you can use below code.
String favImage = "Not Found"; try { Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8"); Element element = document.head().select("link[href~=.*\\.(ico|png)]").first(); if (element == null) { element = document.head().select("meta[itemprop=image]").first(); if (element != null) { favImage = element.attr("content"); } } else { favImage = element.attr("href"); } } catch (IOException e) { e.printStackTrace(); } System.out.println(favImage);
Get all links in HTML page
To get all links present in a webpage, use below code.
try { Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8"); Elements links = document.select("a[href]"); for (Element link : links) { System.out.println("link : " + link.attr("href")); System.out.println("text : " + link.text()); } } catch (IOException e) { e.printStackTrace(); }
Get all images in HTML page
To get all images displayed in a webpage, use below code.
try { Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8"); Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]"); for (Element image : images) { System.out.println("src : " + image.attr("src")); System.out.println("height : " + image.attr("height")); System.out.println("width : " + image.attr("width")); System.out.println("alt : " + image.attr("alt")); } } catch (IOException e) { e.printStackTrace(); }
Get meta information of URL
Meta information consist of what search engines, like Google, use to determine the content of webpage for indexing purpose. They are present in form of some tags in HEAD section of HTML page. To get meta information about a webpage, use below code.
try { Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8"); String description = document.select("meta[name=description]").get(0).attr("content"); System.out.println("Meta description : " + description); String keywords = document.select("meta[name=keywords]").first().attr("content"); System.out.println("Meta keyword : " + keywords); } catch (IOException e) { e.printStackTrace(); }
Get form attributes in html page
Getting form input element in a webpage is very simple. Find the FORM element using unique id; and then find all INPUT elements present in that form.
Document doc = Jsoup.parse(new File("c:/temp/howtodoinjava.com"),"utf-8"); Element formElement = doc.getElementById("loginForm"); Elements inputElements = formElement.getElementsByTag("input"); for (Element inputElement : inputElements) { String key = inputElement.attr("name"); String value = inputElement.attr("value"); System.out.println("Param name: "+key+" \nParam value: "+value); }
Update attributes/content of elements
Just when you have found your desired element using above approaches; you can use Jsoup APIs to update the attributes or innerHTML of those elements. For example, I want to update all links with "rel=nofollow"
present inside document.
try { Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8"); Elements links = document.select("a[href]"); links.attr("rel", "nofollow"); } catch (IOException e) { e.printStackTrace(); }
Sanitize untrusted HTML (to prevent XSS)
Suppose, in your application you want to display HTML snippets submitted by users. e.g. Users may put HTML content in comment box. This can lead to very serious problem, if you allowed to display this HTML directly with cleaning first. User can put some malicious script in it and redirect your users to another dirty website.
To clean this HTML, Jsoup provides Jsoup.clean() method. This method expects an HTML content in form of String and it will return you clean HTML. To perform this task, Jsoup uses whitelist sanitizer. The jsoup whitelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.
It does not use regular expressions, which are inappropriate for this task.
The cleaner is useful not only for avoiding XSS, but also in limiting the range of elements the user can provide: you may be OK with textual a, strong elements, but not structural div or table elements.
String dirtyHTML = "<p><a href='//howtodoinjava.com/' onclick='sendCookiesToMe()'>Link</a></p>"; String cleanHTML = Jsoup.clean(dirtyHTML, Whitelist.basic()); System.out.println(cleanHTML); Output: <p><a href="//howtodoinjava.com/" rel="nofollow">Link</a></p>
That’s all for this very easy yet very powerful and useful library. Drop me your questions in comments section.
Happy Learning !!
How can I read forms or elements from HTML files and then put those HTML into a JSP file on the fly? Let me explain more.
a1.html
something in here
——————
a2.html
something more in here
——————
a3.jsp
a form
——————
a4.jsp/html
put the info here after login has been processed from a servlet which will read a1.html and a2.html set the body this HTML as them.
I wan’t to filter the hyperlinks while parsing a url. Ex: === Jsoup.connect(“https://www.yahoo.com).get();
and get all hyperlinks === Links Elements links = doc.select(“a[href]”); want to ignore url related to (facebook, twitter and google ). please help me with the logic in java with Jsoup
I know it is a late answer but here you go
How to extract data about any company from facebook using jsoup…..?
can you please give sample code…..
Please let me know if jsoup works for post site??
How can i get content from a post site?
What is “post site”?
I would like to fetch the html content of this particular site.http://www.intertraffic.com/amsterdam/exhibitors/
JSoup works surprisingly well for parsing XML content too. We had used it for a project last year.
Thanks for the feedback on Jsoup. It will help others as well.