Convert html to string in java

friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.

Convert html to string in java

Majid

13.3k15 gold badges75 silver badges110 bronze badges

asked Aug 31, 2010 at 10:03

5

Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text.

String plainText= Jsoup.parse(yout_html_text).text();

answered Mar 15, 2019 at 9:01

Convert html to string in java

RanjitRanjit

5,0503 gold badges29 silver badges64 bronze badges

1

Just getting rid of HTML tags is simple:

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character 
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");

But unfortunately the requirements are never that simple:

Usually, <p> and <div> elements need a separate handling, there may be cdata blocks with > characters (e.g. javascript) that mess up the regex etc.

answered Aug 31, 2010 at 10:58

Convert html to string in java

2

You can use this single line to remove the html tags and display it as plain text.

htmlString=htmlString.replaceAll("\\<.*?\\>", "");

demongolem

9,23836 gold badges87 silver badges104 bronze badges

answered Sep 3, 2010 at 10:16

KandhaKandha

3,63712 gold badges34 silver badges47 bronze badges

Use Jsoup.

Add the dependency

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>

Now in your java code:

public static String html2text(String html) {
        return Jsoup.parse(html).wholeText();
    }

Just call the method html2text with passing the html text and it will return plain text.

answered Jan 5, 2021 at 5:45

xxxxxx

1831 silver badge6 bronze badges

I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

answered Aug 31, 2010 at 10:07

Jon FreedmanJon Freedman

9,3694 gold badges42 silver badges56 bronze badges

If you want to parse like browser display, use:

import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;

public class RenderToText {
    public static void main(String[] args) throws Exception {
        String sourceUrlString="data/test.html";
        if (args.length==0)
          System.err.println("Using default argument of \""+sourceUrlString+'"');
        else
            sourceUrlString=args[0];
        if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
        Source source=new Source(new URL(sourceUrlString));
        String renderedText=source.getRenderer().toString();
        System.out.println("\nSimple rendering of the HTML document:\n");
        System.out.println(renderedText);
  }
}

I hope this will help to parse table also in the browser format.

Thanks, Ganesh

mtb

1,31815 silver badges31 bronze badges

answered Nov 14, 2016 at 12:34

1

I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed). http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726

My code:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

The maxLineLength ensures lines are not artificially wrapped at 80 characters. The setNewLine(null) uses the same new line character(s) as the source.

Convert html to string in java

Stephen Rauch

45.7k30 gold badges105 silver badges126 bronze badges

answered Oct 4, 2018 at 1:04

Convert html to string in java

I use HTMLUtil.textFromHTML(value) from

<dependency>
    <groupId>org.clapper</groupId>
    <artifactId>javautil</artifactId>
    <version>3.2.0</version>
</dependency>

answered May 20, 2020 at 10:04

Using Jsoup, I got all the text in the same line.

So I used the following block of code to parse HTML and keep new lines:

private String parseHTMLContent(String toString) {
    String result = toString.replaceAll("\\<.*?\\>", "\n");
    String previousResult = "";
    while(!previousResult.equals(result)){
        previousResult = result;
        result = result.replaceAll("\n\n","\n");
    }
    return result;
}

Not the best solution but solved my problem :)

answered Jan 12, 2021 at 21:25

How do I convert HTML to normal text?

Convert HTML file to a text file (preserving HTML code and text)..
Click the File tab again, then click the Save as option..
In the Save as type drop-down list, select the Plain Text (*. txt) option. ... .
Click the Save button to save as a text document..

Can Java read HTML file?

In java, we can extract the HTML content and can parse the HTML Document.

Can you convert Java to HTML?

Find and select the JAVA files on your computer and click Open to bring them into Doxillion to convert them to the HTML file format. You can also drag and drop your JAVA files directly into the program to convert them as well.