Web Scraping using Jsoup in Java

Hi techies, here I am trying to share some information regarding the term “Web Scraping” & will discuss about the good tool (library) available in java to do so. So, let’s first define Web Scraping in simple terms. So, basically it is the technique to extract data from a website. Why would we need the data from any website? Maybe to analyze the data & come up with some helpful/meaningful outcome! Maybe it can help in building up a data center or publishing a book/publication using that data!

Let’s discuss the “how to” part of web scraping keeping java as an implementation language in mind (python is another best option nowadays!). In java web scraping can be implemented using Jsoup Library which is an open-source Java library used mainly for extracting data from HTML. It also allows you to manipulate and output HTML. It has a steady development line, great documentation, and a fluent and flexible API. Jsoup can also be used to parse and build XML. Below is the way to reach out to the latest versions of Jsoup library.

Maven:

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>

Gradle:

compile group: ‘org.jsoup’, name: ‘jsoup’, version: ‘1.13.1’

To be updated with the latest versions of jsoup below is the link:

https://mvnrepository.com/artifact/org.jsoup/jsoup

So, let’s see a basic example of how to connect to a website & load its HTML DOM tree into our code,

Document doc = Jsoup.connect(“http://example.com”).get(); doc.select(“p”).forEach(System.out::println);

This way we can simply fetch the whole HTML DOM tree into jsoup’s Document object & print all the <p> tag text. You can also define a timeout in milliseconds using the timeout method as,

Document doc = Jsoup.connect(“http://example.com”).timeout(5000).get();

Necessary headers can be provided in 2 ways either using the header() method by passing key value pair in it or by using headers() method & pass a Map of headers. The same way cookies can be processed.

While we extract the HTML we might think why to get all <p> tags why not filter it on the basis of ID or class. So, here’s a solution to that,

doc.select(“p#pid”).forEach(System.out::println); //p#pid means p tags with id=”pid”

doc.select(“p.pclass”).forEach(System.out::println); //p.pclass means p tags with class=”pclass”

And to play with nested tags, you can simply use single space as delimiter between multiple tags,

doc.select(“p div”).forEach(System.out::println); //selects <div> tags that are nested under the <p> tag.

There are many helpful APIs that are self-explanatory according to their names, provided by Jsoup to traverse through the DOM tree in a HTML document like,

Element firstSection = sections.first();
Element lastSection = sections.last();
Element secondSection = sections.get(2);
Elements allParents = firstSection.parents();
Element parent = firstSection.parent();
Elements children = firstSection.children();
Elements siblings = firstSection.siblingElements();

I hope the readers of this blog “Web Scraping using Jsoup in java” finds it helpful to them. Thankyou and I will be back with the second beneficial blog with the same Jsoup Library to extract & parse XML documents.

henil

Henil Mamaniya
Associate Software Developer
December 28, 2020