Operate HTML through Java ----- jsoup

JSOUP

Chinese official documents

jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can fetch and manipulate data through DOM, CSS and operation methods similar to jQuery.

Main functions:

  1. Parsing HTML from a URL, file or string;

  2. Use DOM or CSS selector to find and retrieve data;

  3. Operable HTML elements, attributes and text;

  4. jsoup is released based on MIT protocol and can be safely used in commercial projects.

Dependencies:

  • jsoup is completely self-contained and has no dependencies.

  • jsoup is available in Java 7 and later, Scala, Kotlin, Android, OSGi, Lambda and Google App
    Run on Engine.

Maven and jar download location

https://jsoup.org/download

------Parsing HTML documents-------

    //html 
String html = "<html><head><title>First parse</title></head>"
    + "<body><p>Parsed HTML into a doc.</p></body></html>";
    //Parsing DOM
  Document doc = Jsoup.parse(html);

The parser will make every effort to create clean parsing from the HTML you provide, regardless of whether the html is formatted correctly or not. It handles:

  • Label not closed (for example, < p > lorem < p > Ipsum resolves to < p > lorem < / P > < p > Ipsum < / P >)
  • Implicit label (for example, wrap naked < td > table data < / td > in < Table > < tr > < td >...)
  • Reliably create the document structure (html contains head and body, and the header contains only the appropriate elements)

Extract attributes, text and HTML from elements

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

Other APIs:

Element.id()
Element.tagName()
Element.className() and Element.hasClass(String className)

Get and parse HTML documents from the Internet, and then find data in them (screen capture)

Jsoup.connect(String url)

Get sample domain

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

The connect(String url) method creates a new Connection, and get() gets and parses an HTML file. If an error occurs when extracting the web address, an IOException will be thrown, and you should handle it appropriately.
- instance -

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupString {
  public static void main(String[] args) {
	  Document doc;
	try {
		doc = Jsoup.connect("https://www.chrisyoung777.com").get();
		  String title = doc.title();
		  System.out.println(title);
	} catch (IOException e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
	}

}
}
//console output 
Hi.man

Data modification

  • Set attribute value

Elements provides methods for batch operation of element attributes and class es. For example, to add a rel = "nofollow" to each a element in div, you can use the following methods:

  • Set the html content of the element
Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
div.prepend("<p>First</p>");//Add html content before div
div.append("<p>Last</p>");//Add html content after div
// Results after adding: < div > < p > first < / P > < p > lorem Ipsum < / P > < p > last < / P > < / div >

Element span = doc.select("span").first(); // <span>One</span>
span.wrap("<li><a href='http://example.com/'></a></li>");
// Results after adding: < li > < a href=“ http://example.com "><span>One</span></a></li>
  • Sets the text content of the element
Element div = doc.select("div").first(); // <div></div>
div.text("five > four"); // <div>five &gt; four</div>
div.prepend("First ");
div.append(" Last");
// now: <div>First five &gt; four Last</div>

Tags: Java html

Posted by beerman on Mon, 16 May 2022 03:00:36 +0300