How to remove specific html tags from string in java

My input is plain text string and requirement is to remove all html tags except few specific tags like:

<p>
<li>
<u>
<li>

If these specific tags have attributes like class or id, I want to remove these attributes.

A few examples:

<a href = "#">Link</a>            ->   Link

<p>paragraph</p>                  ->   <p>paragraph</p>

<p class="class1">paragraph</p>   ->   <p>paragraph</p>

I have gone through this Remove HTML tags from a String but it does not answer my question completely.

Can it be handled by a set of regex's or could I make use of some library?

asked Aug 11, 2011 at 10:00

RandomQuestionRandomQuestion

6,62616 gold badges58 silver badges95 bronze badges

2

I tried JSoup and It seems to be able to handle all such cases. Here is example code.

 public String clean(String unsafe){
        Whitelist whitelist = Whitelist.none();
        whitelist.addTags(new String[]{"p","br","ul"});

        String safe = Jsoup.clean(unsafe, whitelist);
        return StringEscapeUtils.unescapeXml(safe);
 }

For input string

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

I get following output which is pretty much I require.

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>

answered Aug 11, 2011 at 19:32

RandomQuestionRandomQuestion

6,62616 gold badges58 silver badges95 bronze badges

1

For simple HTML, this may be sufficient:

// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));

Hope that helps.

answered Aug 11, 2011 at 10:55

How to remove specific html tags from string in java

3

Not the answer you're looking for? Browse other questions tagged java html or ask your own question.

Remove HTML tags from String in Java example shows how to remove HTML tags from String in Java using a regular expression and Jsoup library.

You can remove simple HTML tags from a string using a regular expression. Usually, HTML tags are enclosed in “<” and “>” brackets, so we are going to use the "<[^>]*>" pattern to match anything between these brackets and replace them with the empty string to remove them.

< - start bracket

[^>] - followed by any character which is not closing bracket ">"

* - zero or more times

> - followed by closing bracket

Example

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

packagecom.javacodeexamples.stringexamples;

publicclassRemoveHTMLTagsFromStringExample{

    publicstaticvoid main(String[]args){

        String[]strHTMLTexts={

                "<a href=\"#\">HTML Link</a>",

                "<table><tr><td>column1</td></tr></table>",

                "<script>alert('javascript');</script>",

                "<br /><  BR  >line break<bR/><br>",

                "<!-- html comment --><b>bold text</b>",

                "&nbsp;&nbsp;Jack &amp; Jones",

                "&lt;script&gt;"

        };

        //match HTML tags

        String strRegEx="<[^>]*>";

        //replace them with empty string to remove them

        for(Stringstr:strHTMLTexts){    

            System.out.println(str.replaceAll(strRegEx,""));

        }    

    }

}

Output

HTML Link

column1

alert('javascript');

line break

bold text

&nbsp;&nbsp;Jack &amp; Jones

&lt;script&gt;

The above regular expression worked fine except it did not handle the HTML entities like “&nbsp;” and “&amp;”. Depending on the requirement, you can either replace them with the equivalent characters one by one or remove them using "&.*?;" pattern.

& - & character

.* - followed by any character

?; - followed by semicolon

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

String[]strHTMLTexts={

        "<a href=\"#\">HTML Link</a>",

        "<table><tr><td>column1</td></tr></table>",

        "<script>alert('javascript');</script>",

        "<br /><  BR  >line break<bR/><br>",

        "<!-- html comment --><b>bold text</b>",

        "&nbsp;&nbsp;Jack &amp; Jones",

        "&lt;script&gt;"

};

StringstrRegEx="<[^>]*>";

for(Stringstr:strHTMLTexts){    

    str= str.replaceAll(strRegEx,"");

    //replace &nbsp; with space

    str=str.replace("&nbsp;"," ");

    //replace &amp; with &

    str=str.replace("&amp;","&");            

    //OR remove all HTML entities

    str=str.replaceAll("&.*?;","");

    System.out.println(str);

}

Output

HTML Link

column1

alert('javascript');

line break

bold text

Jack & Jones

script

How to remove specific HTML tags from the String?

What if you want to remove only a specific HTML tag from String? You can do that using regular expression too. Suppose you want to remove “a” tag from the String “<a href=’#’>HTML<b>Bold</b>link</a>”. You can use the "<[/]?a[^>]*>" pattern to remove that.

< - opening bracket

[/]? -  followed by zero or one “/” to match closing tag

a - followed by “a” character

[^>] - followed by any character which is not closing bracket ">"

* - zero or more times

> - followed by closing bracket ">"

StringstrHtml="<a href='#'>HTML<b>Bold</b>link</a>";

StringstrRegEx="<[/]?a[^>]*>";

System.out.println( strHtml.replaceAll(strRegEx,""));

Output

Let’s run some more tests to make sure that the pattern works.

String[]strHtmlLinks={

    "<a href='#'>HTML<b>Bold</b>link</a>",

    "<A href=''></A>",

    "< a href='#'></ a >",

    "< a href='#'>< / a >"

};

StringstrRegEx="<[/]?a[^>]*>";

for(Stringhtml:strHtmlLinks)

    System.out.println(html.replaceAll(strRegEx, ""));

Output

HTML<b>Bold</b>link

<A href=''></A>

< a href='#'></ a >

< a href='#'>< / a >

HTML is not a strict language. As you can see from the output, our pattern failed when an HTML tag was specified in the upper case or having multiple spaces. Let’s modify the pattern to “(?i)<[\\s]*[/]?[\\s]*a[^>]*>” to cover these scenarios.

(?i) - case insensitive comparison

< - opening bracket "<"

[\\s]* - followed by zero or more spaces

[/]? - followed by zero or one "/"

[\\s]* - followed by zero or more spaces

a -  followed by "a"

[^>] - followed by any character which is not closing bracket ">"

* - zero or more times

> - followed by closing bracket ">"

Example

String[]strHtmlLinks={

    "<a href='#'>HTML<b>Bold</b>link</a>",

    "<A href=''>Link</A>",

    "< a href='#'>Link</ a >",

    "< a href='#'>Link< / a >"

};

StringstrRegEx="(?i)<[\\s]*[/]?[\\s]*a[^>]*>";

for(Stringhtml:strHtmlLinks)

    System.out.println( html.replaceAll(strRegEx,""));

Output

HTML<b>Bold</b>link

Link

Link

Link

The short answer is NO. Till now we have only seen happy scenarios. Consider below given example HTML string.

StringstrHtml="<b Very important text</b>Gone!";

StringstrRegEx="<[^>]*>";

System.out.println(strHtml.replaceAll( strRegEx,""));

Output

Our important text was removed by regular expression because HTML was not well-formed. It is very common to encounter such malformed HTML which cannot be taken care of by a regular expression. Consider another example.

StringstrHtml="<strong>Maths: a < b & b > c</strong>";

StringstrRegEx="<[^>]*>";

System.out.println( strHtml.replaceAll(strRegEx,""));

Output

What should I use to remove the HTML tags?

If you are removing a tag or two from the string and you are absolutely certain that the input HTML is well-formed, using regular expression is OK. In all other scenarios, using HTML parser is the way to go.

One such parser is Jsoup. Here is how you can remove the HTML elements from the string using Jsoup example.

StringstrHtml="<strong>Maths: a < b & b > c</strong>";

Stringtext=Jsoup.parse(strHtml).text();

System.out.println(text);

Output

The Jsoup library even allows you to whitelist elements in case you want to retain some tags while clearing all others.

This example is a part of the Java String tutorial, Java RegEx Tutorial, and Jsoup Tutorial.

Please let me know your views in the comments section below.

About the author

How to remove specific html tags from string in java

My name is RahimV and I have over 16 years of experience in designing and developing Java applications. Over the years I have worked with many fortune 500 companies as an eCommerce Architect. My goal is to provide high quality but simple to understand Java tutorials and examples for free. If you like my website, follow me on Facebook and Twitter.

How do I remove text tags in HTML?

Removing HTML Tags from Text.
Press Ctrl+H. ... .
Click the More button, if it is available. ... .
Make sure the Use Wildcards check box is selected..
In the Find What box, enter the following: \<i\>([!<]@)\.
In the Replace With box, enter the following: \1..
With the insertion point still in the Replace With box, press Ctrl+I once..

Is it possible to remove the HTML tags from data?

Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.

How do I remove a specific HTML tag from a string in PHP?

PHP provides an inbuilt function to remove the HTML tags from the data. The strip_tags() function is an inbuilt function in PHP that removes the strings form HTML, XML and PHP tags. It accepts two parameters. This function returns a string with all NULL bytes, HTML, and PHP tags stripped from a given $str.

Which function is used to remove all HTML tags from a string passed to a form?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter.