Parsing and Extracting HTML with Jsoup

Jsoup is to HTML, what XML parsers are to XML. Jsoup parses HTML. Its jquery like selector syntax is very easy to use and very flexible to get the desired result.

1. Introduction to Jsoup

jsoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers.

It scrapes and parses HTML from a URL, file, or string
finds and extracts data, using DOM traversal or CSS selectors
manipulates the HTML elements, attributes, and text
cleans user-submitted content against a safe whitelist, to prevent XSS attacks
outputs tidy HTML

2. Maven Dependencies

We can include Jsoup into our project using its latest version from the maven repository.

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.2</version>
</dependency>

3. Core Classes

Though there are many classes in Jsoup library, but you will mostly be dealing with the below given 3 classes. Let us look at them

3.1. `org.jsoup.Jsoup`

It is the entry point for using Jsoup and provides methods for loading and parsing HTML documents from a variety of sources.

Some important methods of Jsoup class are given below:

static Connection connect(String url): create and returns connection of URL.
static Document parse(File in, String charsetName): parses the specified charset file into document.
static Document parse(String html): parses the given html code into document.
static String clean(String bodyHtml, Whitelist whitelist): returns safe HTML from input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

3.2. `org.jsoup.nodes.Document`

This class represents an HTML document loaded through the Jsoup library. You can use this class to perform operations that should be applicable to the whole HTML document.

3.3. `org.jsoup.nodes.Element`

As you know that an HTML element consists of a tag name, attributes, and child nodes. Using Element class, you can extract data, traverse the node graph, and manipulate the HTML.

4. Loading an HTML Document

4.1. From a URL

Use Jsoup.connect() method to load HTML from a URL.

Document document = Jsoup.connect("//howtodoinjava.com").get();

4.2. From a File

Pass the file path to Jsoup.parse() method to load HTML from a file.

Document document = Jsoup.parse( new File( "c:/temp/demo.html" ) , "utf-8" );

4.3. Load a document from a String

The Jsoup.parse() can also load the HTML from a string.

String html = "<html><head><title>First parse</title></head>"
            + "<body><p>Parsed HTML into a doc.</p></body></html>";

Document document = Jsoup.parse(html);

5. Extracting Information from HTML

5.1. Get Title from HTML

The document.title() method to get the title of HTML page.

Document document = Jsoup.parse( ... );
System.out.println( document.title() );

5.2. Get Favicon

Assuming that the favicon image will be the first image in <head> section of the HTML document, we can use below code.

Document document = Jsoup.parse(...);
  Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
  if (element == null) 
  {
    element = document.head().select("meta[itemprop=image]").first();
    if (element != null) 
    {
      favImage = element.attr("content");
    }
  } 
  else
  {
    favImage = element.attr("href");
  }

5.3. Get All Links

To get all links present in a webpage, use the below code.

Document document = Jsoup.parse(...);
  Elements links = document.select("a[href]");  
  for (Element link : links) 
  {
     System.out.println("link : " + link.attr("href"));  
         System.out.println("text : " + link.text());  
  }

5.4. Get All Images

To get all images displayed on a webpage, use the below code.

Document document = Jsoup.parse(...);
  Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
  for (Element image : images) 
  {
    System.out.println("src : " + image.attr("src"));
    System.out.println("height : " + image.attr("height"));
    System.out.println("width : " + image.attr("width"));
    System.out.println("alt : " + image.attr("alt"));
  }

5.5. Get Meta Information of URL

Meta information consists of what search engines, like Google, use to determine the content of a webpage for indexing purposes. They are present in form of some tags in the HEAD section. To get meta information about a webpage, use the below code.

Document document = Jsoup.parse(...);

String description = document.select("meta[name=description]").get(0).attr("content");  
System.out.println("Meta description : " + description);  
 
String keywords = document.select("meta[name=keywords]").first().attr("content");  
System.out.println("Meta keyword : " + keywords);

5.6. Get Form Attributes

Getting form input elements in a webpage is very simple. Find the FORM element using a unique id; and then find all INPUT elements present in that form.

Document doc = Jsoup.parse(...);  
Element formElement = doc.getElementById("loginForm");  
 
Elements inputElements = formElement.getElementsByTag("input");  
for (Element inputElement : inputElements) {  
    String key = inputElement.attr("name");  
    String value = inputElement.attr("value");  
    System.out.println("Param name: "+key+" \nParam value: "+value);  
}

5.7. Update HTML Attributes/Content

Just when we have found your desired element using the above approaches; we can use Jsoup APIs to update the attributes or innerHTML of those elements. For example, I want to update all links with "rel=nofollow" present inside document.

Document document = Jsoup.parse(...);

Elements links = document.select("a[href]");  
links.attr("rel", "nofollow");

5.8. Sanitize Untrusted HTML (to prevent XSS)

Suppose your application wants to display HTML snippets submitted by users. e.g. Users may put HTML content in the comment box. This can lead to a very serious problem if you have allowed displaying this HTML directly with cleaning first. Users can put some malicious script in it and redirect your users to another dirty website.

To clean this HTML, Jsoup provides Jsoup.clean() method. This method expects HTML content in form of String, and it will returns clean HTML. To perform cleanup, Jsoup uses a whitelist sanitizer. The jsoup whitelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.

It does not use regular expressions, which are inappropriate for this task.

The cleaner is useful for avoiding XSS and limiting the range of elements the user can provide: you may be OK with textual anchor and strong elements, but not structural div or table elements.

String dirtyHTML = "<p><a href='//howtodoinjava.com/' onclick='sendCookiesToMe()'>Link</a></p>";
 
String cleanHTML = Jsoup.clean(dirtyHTML, Whitelist.basic());
 
System.out.println(cleanHTML);
 
Output:
 
<p><a href="//howtodoinjava.com/" rel="nofollow">Link</a></p>

6. Conclusion

In this Java tutorial, we learned the basics of Jsoup library that is used as HTML parser. We checked out how to load the HTML documents, and how to extract specific information from the HTML.

Happy Learning !!

Parsing and Extracting HTML with Jsoup

1. Introduction to Jsoup

2. Maven Dependencies

3. Core Classes

3.1. `org.jsoup.Jsoup`

3.2. `org.jsoup.nodes.Document`

3.3. `org.jsoup.nodes.Element`

4. Loading an HTML Document

4.1. From a URL

4.2. From a File

4.3. Load a document from a String

5. Extracting Information from HTML

5.1. Get Title from HTML

5.2. Get Favicon

5.3. Get All Links

5.4. Get All Images

5.5. Get Meta Information of URL

5.6. Get Form Attributes

5.7. Update HTML Attributes/Content

5.8. Sanitize Untrusted HTML (to prevent XSS)

6. Conclusion

Comments

About Us

Tutorial Series

Meta Links

Our Blogs

Dark Mode

1. Introduction to Jsoup

2. Maven Dependencies

3. Core Classes

3.1. org.jsoup.Jsoup

3.2. org.jsoup.nodes.Document

3.3. org.jsoup.nodes.Element

4. Loading an HTML Document

4.1. From a URL

4.2. From a File

4.3. Load a document from a String

5. Extracting Information from HTML

5.1. Get Title from HTML

5.2. Get Favicon

5.3. Get All Links

5.4. Get All Images

5.5. Get Meta Information of URL

5.6. Get Form Attributes

5.7. Update HTML Attributes/Content

5.8. Sanitize Untrusted HTML (to prevent XSS)

6. Conclusion

Related posts:

Comments

About Us

Tutorial Series

Meta Links

Our Blogs

Dark Mode

3.1. `org.jsoup.Jsoup`

3.2. `org.jsoup.nodes.Document`

3.3. `org.jsoup.nodes.Element`