Jsoup HTML Parser Example

Jsoup is to HTML, what XML parsers are to XML. It parses HTML; real world HTML. Its jquery like selector syntax is very easy to use and very flexible to get the desired result. In this tutorial, we will go through a lot of examples of Jsoup.

Table of Contents

What all you can achieve with Jsoup?
Runtime Dependencies
Main classes you should know
Loading a Document
Get title from HTML
Get Fav icon of HTML page
Get all links in HTML page
Get all images in HTML page
Get meta information of URL
Get form attributes in html page
Update attributes/content of elements
Sanitize untrusted HTML (to prevent XSS)

What all you can achieve with Jsoup?

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

  1. scrape and parse HTML from a URL, file, or string
  2. find and extract data, using DOM traversal or CSS selectors
  3. manipulate the HTML elements, attributes, and text
  4. clean user-submitted content against a safe white-list, to prevent XSS attacks
  5. output tidy HTML

Runtime Dependencies

You can include Jsoup jars into your project using below maven dependency.

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.8.2</version>
</dependency>

Or you can directly download jsoup-1.8.2.jar from jsoup.org website and add it to project’s lib folder.

Main classes you should know

Though there are many classes in complete library, but mostly you will be dealing with below given 3 classes. let’s look at them

  1. org.jsoup.Jsoup

    Jsoup class will be entry point for any program and will provide methods are loading and parsing HTML documents from variety of sources.

    Some important methods of Jsoup class are given below:

    [su_table]

    MethodDescription
    static Connection connect(String url)create and returns connection of URL.
    static Document parse(File in, String charsetName)parses the specified charset file into document.
    static Document parse(String html)parses the given html code into document.
    static String clean(String bodyHtml, Whitelist whitelist)returns safe HTML from input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

    [/su_table]

  2. org.jsoup.nodes.Document

    This class represent an HTML document loaded through Jsoup library. You can use this class to perform operations that should be applicable on whole HTML document.

    Important methods of Element class can be looked at https://jsoup.org/apidocs/org/jsoup/nodes/Document.html.

  3. org.jsoup.nodes.Element

    As you know that an HTML element consists of a tag name, attributes, and child nodes. Using Element class, you can extract data, traverse the node graph, and manipulate the HTML.

    Important methods of Element class can be looked at https://jsoup.org/apidocs/org/jsoup/nodes/Element.html.

Now let’s look at some examples to work with HTML documents using Jsoup APIs.

Loading a Document

Load a document from URL

Use Jsoup.connect() method to load HTML from a URL.

try 
{
	Document document = Jsoup.connect("//howtodoinjava.com").get();
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Load a document from File

Use Jsoup.parse() method to load HTML from a file.

try 
{
	Document document = Jsoup.parse( new File( "c:/temp/demo.html" ) , "utf-8" );
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Load a document from String

Use Jsoup.parse() method to load HTML from a string.

try 
{
	String html = "<html><head><title>First parse</title></head>"
  					+ "<body><p>Parsed HTML into a doc.</p></body></html>";
	Document document = Jsoup.parse(html);
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Get title from HTML

As shown above, call document.title() method to get the title of HTML page.

try 
{
	Document document = Jsoup.parse( new File("C:/Users/xyz/Desktop/howtodoinjava.html"), "utf-8");
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Get Fav icon of HTML page

Assuming that favicon image will be first image in <head> section of HTML document, you can use below code.

String favImage = "Not Found";
try {
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
	if (element == null) 
	{
		element = document.head().select("meta[itemprop=image]").first();
		if (element != null) 
		{
			favImage = element.attr("content");
		}
	} 
	else 
	{
		favImage = element.attr("href");
	}
} 
catch (IOException e) 
{
	e.printStackTrace();
}
System.out.println(favImage);

Get all links in HTML page

To get all links present in a webpage, use below code.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Elements links = document.select("a[href]");  
	for (Element link : links) 
	{
		 System.out.println("link : " + link.attr("href"));  
         System.out.println("text : " + link.text());  
	}
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Get all images in HTML page

To get all images displayed in a webpage, use below code.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
	for (Element image : images) 
	{
		System.out.println("src : " + image.attr("src"));
		System.out.println("height : " + image.attr("height"));
		System.out.println("width : " + image.attr("width"));
		System.out.println("alt : " + image.attr("alt"));
	}
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Get meta information of URL

Meta information consist of what search engines, like Google, use to determine the content of webpage for indexing purpose. They are present in form of some tags in HEAD section of HTML page. To get meta information about a webpage, use below code.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	
    String description = document.select("meta[name=description]").get(0).attr("content");  
    System.out.println("Meta description : " + description);  
    
    String keywords = document.select("meta[name=keywords]").first().attr("content");  
    System.out.println("Meta keyword : " + keywords);  
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Get form attributes in html page

Getting form input element in a webpage is very simple. Find the FORM element using unique id; and then find all INPUT elements present in that form.

Document doc = Jsoup.parse(new File("c:/temp/howtodoinjava.com"),"utf-8");  
Element formElement = doc.getElementById("loginForm");  

Elements inputElements = formElement.getElementsByTag("input");  
for (Element inputElement : inputElements) {  
    String key = inputElement.attr("name");  
    String value = inputElement.attr("value");  
    System.out.println("Param name: "+key+" \nParam value: "+value);  
} 

Update attributes/content of elements

Just when you have found your desired element using above approaches; you can use Jsoup APIs to update the attributes or innerHTML of those elements. For example, I want to update all links with "rel=nofollow" present inside document.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Elements links = document.select("a[href]");  
	links.attr("rel", "nofollow");
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Sanitize untrusted HTML (to prevent XSS)

Suppose, in your application you want to display HTML snippets submitted by users. e.g. Users may put HTML content in comment box. This can lead to very serious problem, if you allowed to display this HTML directly with cleaning first. User can put some malicious script in it and redirect your users to another dirty website.

To clean this HTML, Jsoup provides Jsoup.clean() method. This method expects an HTML content in form of String and it will return you clean HTML. To perform this task, Jsoup uses whitelist sanitizer. The jsoup whitelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.

It does not use regular expressions, which are inappropriate for this task.

The cleaner is useful not only for avoiding XSS, but also in limiting the range of elements the user can provide: you may be OK with textual a, strong elements, but not structural div or table elements.

String dirtyHTML = "<p><a href='//howtodoinjava.com/' onclick='sendCookiesToMe()'>Link</a></p>";

String cleanHTML = Jsoup.clean(dirtyHTML, Whitelist.basic());

System.out.println(cleanHTML);

Output:

<p><a href="//howtodoinjava.com/" rel="nofollow">Link</a></p>

That’s all for this very easy yet very powerful and useful library. Drop me your questions in comments section.

Happy Learning !!

Was this post helpful?

Join 7000+ Fellow Programmers

Subscribe to get new post notifications, industry updates, best practices, and much more. Directly into your inbox, for free.

9 thoughts on “Jsoup HTML Parser Example”

  1. How can I read forms or elements from HTML files and then put those HTML into a JSP file on the fly? Let me explain more.

    a1.html

    something in here

    ——————
    a2.html

    something more in here

    ——————
    a3.jsp

    a form

    ——————
    a4.jsp/html

    put the info here after login has been processed from a servlet which will read a1.html and a2.html set the body this HTML as them.

    @Override
        protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException
        {
            // process form
            Document doc1 = Jsoup.parse(new File("resrcs/a1.html"), "utf-8");
            Element something1 = doc1.getElementById("a1");
    
            Document doc2 = Jsoup.parse(new File("resrcs/a2.html"), "utf-8");
            Element something2 = doc2.getElementById("a2");
            something2.append(something1.html());
    
           // put something2 in a4.html/jsp
        }
    
    Reply
  2. I wan’t to filter the hyperlinks while parsing a url. Ex: === Jsoup.connect(“https://www.yahoo.com).get();
    and get all hyperlinks === Links Elements links = doc.select(“a[href]”); want to ignore url related to (facebook, twitter and google ). please help me with the logic in java with Jsoup

    Reply
    • I know it is a late answer but here you go

       Elements links = doc.select("a[href]");
       for (Element link : links)
             {
      		if (!link.attr("href").contains(".facebook") && !link.attr("href").contains(".twitter") && !link.attr("href").contains(".google")  ) {
      			// do your modification with the link
      		}
      
      	}
      
      Reply

Leave a Comment

HowToDoInJava

A blog about Java and its related technologies, the best practices, algorithms, interview questions, scripting languages, and Python.