We may have unwanted non-ascii characters into file content or string from variety of ways e.g. from copying and pasting the text from an MS Word document or web browser, PDF-to-text conversion or HTML-to-text conversion. we may want to remove non-printable characters before using the file into the application because they prove to be problem when we start data processing on this file’s content.
In this java regex example, I am using regular expressions to search and replace non-ascii characters and even remove non-printable characters as well.
1. Java remove non-printable characters
Java program to clean string content from unwanted chars and non-printable chars.
private static String cleanTextContent(String text)
{
// strips off all non-ASCII characters
text = text.replaceAll("[^\\x00-\\x7F]", "");
// erases all the ASCII control characters
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
text = text.replaceAll("\\p{C}", "");
return text.trim();
}
2. Remove non-printable characters example
2.1. File content with non-ascii content
I will read a file with following content and remove all non-ascii characters including non-printable characters.
öäü how to do in java . com A função, Ãugent
2.2. Java program to clean ASCII text
package com.howtodoinjava.demo;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.stream.Stream;
public class CleanTextExample
{
public static void main(String[] args)
{
File file = new File("c:/temp/data.txt");
String uncleanContent = readFileIntoString(file);
System.out.println(uncleanContent);
String cleanContent = cleanTextContent(uncleanContent);
System.out.println(cleanContent);
}
private static String readFileIntoString(File file)
{
StringBuilder contentBuilder = new StringBuilder();
try (Stream<String> stream = Files.lines(Paths.get(file.toURI())))
{
stream.forEach(s -> contentBuilder.append(s).append("\n"));
}
catch (IOException e)
{
System.out.println("Error reading " + file.getAbsolutePath());
}
return contentBuilder.toString();
}
private static String cleanTextContent(String text)
{
// strips off all non-ASCII characters
text = text.replaceAll("[^\\x00-\\x7F]", "");
// erases all the ASCII control characters
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
text = text.replaceAll("\\p{C}", "");
return text.trim();
}
}
Program Output.
öäü how to do in java . com A função, Ãugent how to do in java . com A funo, ugent
Feel free to modify the cleanTextContent() method as per your need – and add/remove regex as per requirements.
Happy Learning !!
Comments