Java remove non-printable non-ascii characters using regex

We may have unwanted non-ascii characters into file content or string from variety of ways e.g. from copying and pasting the text from an MS Word document or web browser, PDF-to-text conversion or HTML-to-text conversion. we may want to remove non-printable characters before using the file into the application because they prove to be problem when we start data processing on this file’s content.

In this java regex example, I am using regular expressions to search and replace non-ascii characters and even remove non-printable characters as well.

1. Java remove non-printable characters

Java program to clean string content from unwanted chars and non-printable chars.

private static String cleanTextContent(String text) 
{
	// strips off all non-ASCII characters
	text = text.replaceAll("[^\\x00-\\x7F]", "");

	// erases all the ASCII control characters
	text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
	
	// removes non-printable characters from Unicode
	text = text.replaceAll("\\p{C}", "");

	return text.trim();
}

2. Remove non-printable characters example

2.1. File content with non-ascii content

I will read a file with following content and remove all non-ascii characters including non-printable characters.

öäü how to do in java . com A função, Ãugent

2.2. Java program to clean ASCII text

package com.howtodoinjava.demo;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.stream.Stream;

public class CleanTextExample 
{
	public static void main(String[] args) 
	{
		
		File file = new File("c:/temp/data.txt");
		
		String uncleanContent = readFileIntoString(file);
		
		System.out.println(uncleanContent);
		
		String cleanContent = cleanTextContent(uncleanContent);
		
		System.out.println(cleanContent);
	}
	
	private static String readFileIntoString(File file)
	{
		StringBuilder contentBuilder = new StringBuilder();
		try (Stream<String> stream = Files.lines(Paths.get(file.toURI()))) 
		{
			stream.forEach(s -> contentBuilder.append(s).append("\n"));
		} 
		catch (IOException e) 
		{
			System.out.println("Error reading " + file.getAbsolutePath());
		}
		return contentBuilder.toString();
	}

	private static String cleanTextContent(String text) 
	{
		// strips off all non-ASCII characters
		text = text.replaceAll("[^\\x00-\\x7F]", "");

		// erases all the ASCII control characters
		text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
		
		// removes non-printable characters from Unicode
		text = text.replaceAll("\\p{C}", "");

		return text.trim();
	}
}

Program Output.

öäü how to do in java . com A função, Ãugent

how to do in java . com A funo, ugent

Feel free to modify the cleanTextContent() method as per your need – and add/remove regex as per requirements.

Happy Learning !!

Was this post helpful?

Join 7000+ Fellow Programmers

Subscribe to get new post notifications, industry updates, best practices, and much more. Directly into your inbox, for free.

3 thoughts on “Java remove non-printable non-ascii characters using regex”

  1. Hi Lokesh,
    I was not knowing about how to work with Control Characters.
    I was not knowing that these characters will not be displayed in editors like eclipse,notepad etc.
    I had requirement to split Control Characters separated words in a binary file.
    I used your code-
    text.replaceAll(“\\p{C}+”, “”);

    It helped me. Thanks a lot.

    Regards,
    Subhashish

    Reply

Leave a Comment

HowToDoInJava

A blog about Java and its related technologies, the best practices, algorithms, interview questions, scripting languages, and Python.