HowToDoInJava

  • Python
  • Java
  • Spring Boot
  • Dark Mode
Home / Java / Java Libraries / Jsoup

Jsoup HTML Parser Example

Jsoup is to HTML, what XML parsers are to XML. It parses HTML; real world HTML. Its jquery like selector syntax is very easy to use and very flexible to get the desired result. In this tutorial, we will go through a lot of examples of Jsoup.

Table of Contents

What all you can achieve with Jsoup?
Runtime Dependencies
Main classes you should know
Loading a Document
Get title from HTML
Get Fav icon of HTML page
Get all links in HTML page
Get all images in HTML page
Get meta information of URL
Get form attributes in html page
Update attributes/content of elements
Sanitize untrusted HTML (to prevent XSS)

What all you can achieve with Jsoup?

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

  1. scrape and parse HTML from a URL, file, or string
  2. find and extract data, using DOM traversal or CSS selectors
  3. manipulate the HTML elements, attributes, and text
  4. clean user-submitted content against a safe white-list, to prevent XSS attacks
  5. output tidy HTML

Runtime Dependencies

You can include Jsoup jars into your project using below maven dependency.

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.8.2</version>
</dependency>

Or you can directly download jsoup-1.8.2.jar from jsoup.org website and add it to project’s lib folder.

Main classes you should know

Though there are many classes in complete library, but mostly you will be dealing with below given 3 classes. let’s look at them

  1. org.jsoup.Jsoup

    Jsoup class will be entry point for any program and will provide methods are loading and parsing HTML documents from variety of sources.

    Some important methods of Jsoup class are given below:

    [su_table]

    MethodDescription
    static Connection connect(String url)create and returns connection of URL.
    static Document parse(File in, String charsetName)parses the specified charset file into document.
    static Document parse(String html)parses the given html code into document.
    static String clean(String bodyHtml, Whitelist whitelist)returns safe HTML from input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

    [/su_table]

  2. org.jsoup.nodes.Document

    This class represent an HTML document loaded through Jsoup library. You can use this class to perform operations that should be applicable on whole HTML document.

    Important methods of Element class can be looked at https://jsoup.org/apidocs/org/jsoup/nodes/Document.html.

  3. org.jsoup.nodes.Element

    As you know that an HTML element consists of a tag name, attributes, and child nodes. Using Element class, you can extract data, traverse the node graph, and manipulate the HTML.

    Important methods of Element class can be looked at https://jsoup.org/apidocs/org/jsoup/nodes/Element.html.

Now let’s look at some examples to work with HTML documents using Jsoup APIs.

Loading a Document

Load a document from URL

Use Jsoup.connect() method to load HTML from a URL.

try 
{
	Document document = Jsoup.connect("//howtodoinjava.com").get();
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Load a document from File

Use Jsoup.parse() method to load HTML from a file.

try 
{
	Document document = Jsoup.parse( new File( "c:/temp/demo.html" ) , "utf-8" );
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Load a document from String

Use Jsoup.parse() method to load HTML from a string.

try 
{
	String html = "<html><head><title>First parse</title></head>"
  					+ "<body><p>Parsed HTML into a doc.</p></body></html>";
	Document document = Jsoup.parse(html);
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Get title from HTML

As shown above, call document.title() method to get the title of HTML page.

try 
{
	Document document = Jsoup.parse( new File("C:/Users/xyz/Desktop/howtodoinjava.html"), "utf-8");
	System.out.println(document.title());
} 
catch (IOException e) 
{
	e.printStackTrace();
}  

Get Fav icon of HTML page

Assuming that favicon image will be first image in <head> section of HTML document, you can use below code.

String favImage = "Not Found";
try {
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
	if (element == null) 
	{
		element = document.head().select("meta[itemprop=image]").first();
		if (element != null) 
		{
			favImage = element.attr("content");
		}
	} 
	else 
	{
		favImage = element.attr("href");
	}
} 
catch (IOException e) 
{
	e.printStackTrace();
}
System.out.println(favImage);

Get all links in HTML page

To get all links present in a webpage, use below code.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Elements links = document.select("a[href]");  
	for (Element link : links) 
	{
		 System.out.println("link : " + link.attr("href"));  
         System.out.println("text : " + link.text());  
	}
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Get all images in HTML page

To get all images displayed in a webpage, use below code.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
	for (Element image : images) 
	{
		System.out.println("src : " + image.attr("src"));
		System.out.println("height : " + image.attr("height"));
		System.out.println("width : " + image.attr("width"));
		System.out.println("alt : " + image.attr("alt"));
	}
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Get meta information of URL

Meta information consist of what search engines, like Google, use to determine the content of webpage for indexing purpose. They are present in form of some tags in HEAD section of HTML page. To get meta information about a webpage, use below code.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	
    String description = document.select("meta[name=description]").get(0).attr("content");  
    System.out.println("Meta description : " + description);  
    
    String keywords = document.select("meta[name=keywords]").first().attr("content");  
    System.out.println("Meta keyword : " + keywords);  
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Get form attributes in html page

Getting form input element in a webpage is very simple. Find the FORM element using unique id; and then find all INPUT elements present in that form.

Document doc = Jsoup.parse(new File("c:/temp/howtodoinjava.com"),"utf-8");  
Element formElement = doc.getElementById("loginForm");  

Elements inputElements = formElement.getElementsByTag("input");  
for (Element inputElement : inputElements) {  
    String key = inputElement.attr("name");  
    String value = inputElement.attr("value");  
    System.out.println("Param name: "+key+" \nParam value: "+value);  
} 

Update attributes/content of elements

Just when you have found your desired element using above approaches; you can use Jsoup APIs to update the attributes or innerHTML of those elements. For example, I want to update all links with "rel=nofollow" present inside document.

try 
{
	Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/howtodoinjava.html"), "utf-8");
	Elements links = document.select("a[href]");  
	links.attr("rel", "nofollow");
} 
catch (IOException e) 
{
	e.printStackTrace();
}

Sanitize untrusted HTML (to prevent XSS)

Suppose, in your application you want to display HTML snippets submitted by users. e.g. Users may put HTML content in comment box. This can lead to very serious problem, if you allowed to display this HTML directly with cleaning first. User can put some malicious script in it and redirect your users to another dirty website.

To clean this HTML, Jsoup provides Jsoup.clean() method. This method expects an HTML content in form of String and it will return you clean HTML. To perform this task, Jsoup uses whitelist sanitizer. The jsoup whitelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.

It does not use regular expressions, which are inappropriate for this task.

The cleaner is useful not only for avoiding XSS, but also in limiting the range of elements the user can provide: you may be OK with textual a, strong elements, but not structural div or table elements.

String dirtyHTML = "<p><a href='//howtodoinjava.com/' onclick='sendCookiesToMe()'>Link</a></p>";

String cleanHTML = Jsoup.clean(dirtyHTML, Whitelist.basic());

System.out.println(cleanHTML);

Output:

<p><a href="//howtodoinjava.com/" rel="nofollow">Link</a></p>

That’s all for this very easy yet very powerful and useful library. Drop me your questions in comments section.

Happy Learning !!

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.
TwitterFacebookLinkedInRedditPocket

About Lokesh Gupta

A family guy with fun loving nature. Love computers, programming and solving everyday problems. Find me on Facebook and Twitter.

Feedback, Discussion and Comments

  1. Hamidur Rahman

    May 30, 2019

    How can I read forms or elements from HTML files and then put those HTML into a JSP file on the fly? Let me explain more.

    a1.html

    something in here

    ——————
    a2.html

    something more in here

    ——————
    a3.jsp

    a form

    ——————
    a4.jsp/html

    put the info here after login has been processed from a servlet which will read a1.html and a2.html set the body this HTML as them.

    @Override
        protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException
        {
            // process form
            Document doc1 = Jsoup.parse(new File("resrcs/a1.html"), "utf-8");
            Element something1 = doc1.getElementById("a1");
    
            Document doc2 = Jsoup.parse(new File("resrcs/a2.html"), "utf-8");
            Element something2 = doc2.getElementById("a2");
            something2.append(something1.html());
    
           // put something2 in a4.html/jsp
        }
    
  2. saurabh

    January 12, 2018

    I wan’t to filter the hyperlinks while parsing a url. Ex: === Jsoup.connect(“https://www.yahoo.com).get();
    and get all hyperlinks === Links Elements links = doc.select(“a[href]”); want to ignore url related to (facebook, twitter and google ). please help me with the logic in java with Jsoup

    • Aisha

      January 9, 2019

      I know it is a late answer but here you go

       Elements links = doc.select("a[href]");
       for (Element link : links)
             {
      		if (!link.attr("href").contains(".facebook") && !link.attr("href").contains(".twitter") && !link.attr("href").contains(".google")  ) {
      			// do your modification with the link
      		}
      
      	}
      
  3. Sunita Narwade

    October 13, 2016

    How to extract data about any company from facebook using jsoup…..?
    can you please give sample code…..

  4. bindu

    September 1, 2016

    Please let me know if jsoup works for post site??
    How can i get content from a post site?

    • Lokesh Gupta

      September 1, 2016

      What is “post site”?

      • bindu

        September 1, 2016

        I would like to fetch the html content of this particular site.http://www.intertraffic.com/amsterdam/exhibitors/

  5. Vignesh

    July 29, 2015

    JSoup works surprisingly well for parsing XML content too. We had used it for a project last year.

    • Lokesh Gupta

      July 29, 2015

      Thanks for the feedback on Jsoup. It will help others as well.

Comments are closed on this article!

Search Tutorials

Open Source Libraries

  • Apache POI Tutorial
  • Apache HttpClient Tutorial
  • iText Tutorial
  • Super CSV Tutorial
  • OpenCSV Tutorial
  • Google Gson Tutorial
  • JMeter Tutorial
  • Docker Tutorial
  • JSON.simple Tutorial
  • RxJava Tutorial
  • Jsoup Parser Tutorial
  • PowerMock Tutorial

Java Tutorial

  • Java Introduction
  • Java Keywords
  • Java Flow Control
  • Java OOP
  • Java Inner Class
  • Java String
  • Java Enum
  • Java Collections
  • Java ArrayList
  • Java HashMap
  • Java Array
  • Java Sort
  • Java Clone
  • Java Date Time
  • Java Concurrency
  • Java Generics
  • Java Serialization
  • Java Input Output
  • Java New I/O
  • Java Exceptions
  • Java Annotations
  • Java Reflection
  • Java Garbage collection
  • Java JDBC
  • Java Security
  • Java Regex
  • Java Servlets
  • Java XML
  • Java Puzzles
  • Java Examples
  • Java Libraries
  • Java Resources
  • Java 14
  • Java 12
  • Java 11
  • Java 10
  • Java 9
  • Java 8
  • Java 7

Meta Links

  • About Me
  • Contact Us
  • Privacy policy
  • Advertise
  • Guest and Sponsored Posts

Recommended Reading

  • 10 Life Lessons
  • Secure Hash Algorithms
  • How Web Servers work?
  • How Java I/O Works Internally?
  • Best Way to Learn Java
  • Java Best Practices Guide
  • Microservices Tutorial
  • REST API Tutorial
  • How to Start New Blog

Copyright © 2020 · HowToDoInjava.com · All Rights Reserved. | Sitemap

  • Sealed Classes and Interfaces