JSOUP
jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can fetch and manipulate data through DOM, CSS and operation methods similar to jQuery.
Main functions:
-
Parsing HTML from a URL, file or string;
-
Use DOM or CSS selector to find and retrieve data;
-
Operable HTML elements, attributes and text;
-
jsoup is released based on MIT protocol and can be safely used in commercial projects.
Dependencies:
-
jsoup is completely self-contained and has no dependencies.
-
jsoup is available in Java 7 and later, Scala, Kotlin, Android, OSGi, Lambda and Google App
Run on Engine.
Maven and jar download location
------Parsing HTML documents-------
//html String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; //Parsing DOM Document doc = Jsoup.parse(html);
The parser will make every effort to create clean parsing from the HTML you provide, regardless of whether the html is formatted correctly or not. It handles:
- Label not closed (for example, < p > lorem < p > Ipsum resolves to < p > lorem < / P > < p > Ipsum < / P >)
- Implicit label (for example, wrap naked < td > table data < / td > in < Table > < tr > < td >...)
- Reliably create the document structure (html contains head and body, and the header contains only the appropriate elements)
Extract attributes, text and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; Document doc = Jsoup.parse(html); Element link = doc.select("a").first(); String text = doc.body().text(); // "An example link" String linkHref = link.attr("href"); // "http://example.com/" String linkText = link.text(); // "example"" String linkOuterH = link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>" String linkInnerH = link.html(); // "<b>example</b>"
Other APIs:
Element.id() Element.tagName() Element.className() and Element.hasClass(String className)
Get and parse HTML documents from the Internet, and then find data in them (screen capture)
Jsoup.connect(String url)
Get sample domain
Document doc = Jsoup.connect("http://example.com/").get(); String title = doc.title();
The connect(String url) method creates a new Connection, and get() gets and parses an HTML file. If an error occurs when extracting the web address, an IOException will be thrown, and you should handle it appropriately.
- instance -
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupString { public static void main(String[] args) { Document doc; try { doc = Jsoup.connect("https://www.chrisyoung777.com").get(); String title = doc.title(); System.out.println(title); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } //console output Hi.man
Data modification
- Set attribute value
Elements provides methods for batch operation of element attributes and class es. For example, to add a rel = "nofollow" to each a element in div, you can use the following methods:
- Set the html content of the element
Element div = doc.select("div").first(); // <div></div> div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div> div.prepend("<p>First</p>");//Add html content before div div.append("<p>Last</p>");//Add html content after div // Results after adding: < div > < p > first < / P > < p > lorem Ipsum < / P > < p > last < / P > < / div > Element span = doc.select("span").first(); // <span>One</span> span.wrap("<li><a href='http://example.com/'></a></li>"); // Results after adding: < li > < a href=“ http://example.com "><span>One</span></a></li>
- Sets the text content of the element
Element div = doc.select("div").first(); // <div></div> div.text("five > four"); // <div>five > four</div> div.prepend("First "); div.append(" Last"); // now: <div>First five > four Last</div>