How to Unescape HTML in Java

Learn to unescape the HTML characters in a String. The given example unescapes an HTML string to a string containing the actual Unicode characters corresponding to the escapes.

1. Using StringEscapeUtils.unescapeHtml4()

The StringEscapeUtils class is part of Apache commons text library to import its latest version from the Maven repository.

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-text</artifactId>
    <version>1.10.0</version>
</dependency>

The unescapeHtml4() method:

  • takes the escaped string as a parameter. Returns null if the argument string is null.
  • supports all known HTML 4.0 entities.
  • If an entity is unrecognized, it is left alone.

We can use StringEscapeUtils.unescapeHtml4() method as follows:

String escapedString = "&lt;java&gt;public static void main(String[] args) { ... }&lt;/java&gt;";
     
String unEscapedHTML = StringEscapeUtils.unescapeHtml4(escapedString);

System.out.println(unEscapedHTML);

The program output:

<java>public static void main(String[] args) { ... }</java>

2. Using Plain Java

We can create a custom method to support additional HTML entities or custom HTML entities that the libraries do not support.

The following method is a method that takes an input string, searches all HTML entities and unescapes them when found. We can add or remove more entities as needed.

private static HashMap<String, String> htmlEntities;

  static {
    htmlEntities = new HashMap<String, String>();
    htmlEntities.put("&lt;", "<");
    htmlEntities.put("&gt;", ">");
    htmlEntities.put("&amp;", "&");
    htmlEntities.put("&quot;", "\"");
    htmlEntities.put("&nbsp;", " ");
    htmlEntities.put("&copy;", "\u00a9");
    htmlEntities.put("&reg;", "\u00ae");
    htmlEntities.put("&euro;", "\u20a0");
  }

  public static final String unescapeHTML(String source) {
    int i, j;

    boolean continueLoop;
    int skip = 0;
    do {
      continueLoop = false;
      i = source.indexOf("&", skip);
      if (i > -1) {
        j = source.indexOf(";", i);
        if (j > i) {
          String entityToLookFor = source.substring(i, j + 1);
          String value = (String) htmlEntities.get(entityToLookFor);
          if (value != null) {
            source = source.substring(0, i)
                + value + source.substring(j + 1);
            continueLoop = true;
          } else if (value == null) {
            skip = i + 1;
            continueLoop = true;
          }
        }
      }
    } while (continueLoop);
    return source;
  }

We can use the above method to unescape the HTML:

String input = "&lt;java&gt;public static void main(String[] args) { ... }&lt;/java&gt;";
     
String output = unescapeHtml(input);

System.out.println(output);

The program output:

<java>public static void main(String[] args) { ... }</java>

Happy Learning !!

References:

Sourcecode on Github

Comments

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

About Us

HowToDoInJava provides tutorials and how-to guides on Java and related technologies.

It also shares the best practices, algorithms & solutions and frequently asked interview questions.

Our Blogs

REST API Tutorial

Dark Mode

Dark Mode