Technology

Web Scraping-get inner div using beautiful soup

get inner div using beautiful soup

Data is all around, from the spreadsheets that we examine on a regular basis to the web pages that we read and the weather prediction that we depend on every morning. The information we use is often presented to us without any effort on our part, and a quick review is all that’s needed to make a judgement. For example, You will take an umbrella with you because you came to know about the 75% probability of rain. However, there are many cases where the offered data is so extensive that we have to dig in and conduct exploratory research.

It’s possible we won’t always have the information we need in a form that’s easy to work with right away. On the other hand, it’s possible that the information can be retrieved via an API (API). Another option is to establish a direct connection to a database and retrieve the necessary data that way. 

 You may have already gained some helpful information from the internet, which is a rich data source. For example, by visiting your preferred Wikipedia page, you may see how many medals each country won at the most recent Olympics in Tokyo. Web pages also have abundant textual material, which may be easily copied and pasted or typed into a text editor. However, web scraping is another option for exploring data. In order to make collecting data easier, we’ll be using a Python package called get inner div using beautiful soup in this post.

Web scraping

We can write a program to scrape the relevant web pages and extract the data we need. Web scraping is the term for what we’re doing, and the code we build to do it requires the HTML for the relevant websites. To get the information, we have to “parse” the HTML that builds up the page. Simply said, we need to do the followings: 

  • Identify the website that contains the data that we are looking for. 
  • Save a copy of the source code. 
  • Figure out which parts of the page contain the data we require 
  • Retrieve and sort the data for analysis
  • Save the information in a usable format. 

In addition, some sites do not provide a clear picture of whether you can scrape their content or not. Therefore, we advise you to read and abide by the disclaimers posted on the sites you visit. In many cases, an API may be used to retrieve the data instead of scraping the site directly. 

 

HTML Primer

The previous discussion emphasised the importance of knowing how an HTML file is organised in order to navigate it successfully. To ensure that a browser displays a website’s content correctly, the page’s format, style, and structure are described in detail using HTML (or HyperText Markup Language). In addition, tags are used in HTML to indicate important parts of a document’s structure. 

Use the angle brackets ( >) to indicate a tag. We must additionally provide the beginning and end of the marked items. For example, to specify the start and end of the text marked with a tag named “mytag”, we use the notation <mytag> and /mytag>, respectively. 

The <html> tag is the most elementary HTML tag; it indicates to the browser that the content in between tags is also HTML. Accordingly, we can define the basic Html web page as: 

<html></html> 

The document that you see above is blank. So let us look at a more practical example:

<html>

<head>

  <title>My HTML page</title>

</head>

<body>

  <p>

    This is one paragraph.

  </p>

  <p>

    This is another paragraph. <b>HTML</b> is cool!

  </p>

  <div>

    <a href=”https://blog.dominodatalab.com/”

    id=”dominodatalab”>Domino Datalab Blog</a>

  </div>

</body>

</html>

We are able to view the HTML tag that we had previously. However, this time, we have more tags contained within it. Tags that are included within another tag are referred to as “children,” and as you may expect, tags can have “parents.” In the example code shown above, the <head> and <body> elements are both children of the <html> tag, making them siblings. A pleasant family! You’ll find the following tags there: 

  • The <head> section of the webpage stores metadata about the page, such as the page title. 
  • The <title> section of the webpage displays the page’s title. 
  • The <body> section of the webpage defines the body of the page.
  • The <p> section of the webpage displays paragraphs. 
  • The <div> section of the webpage displays divisions or areas of the page. 
  • The <b> section of the webpage displays bold font weight. 
  • <a> is a hypertext link. The previous example has two attributes: href, which specifies the location to which the link will take the reader, and id, which is an identifier.  

Okay, let’s give parsing this page.

Get inner div using beautifulsoup

If we copy the above HTML document, open it in a browser, and save its content, we will be able to see this. However, we need to extract this information so that we can use it later. Of course, we could copy and paste the data by hand, but fortunately for us, we don’t have to because we have “get inner div using beautiful soup” to help us out. 

The get inner div using beautiful soup module in Python is capable of decoding the meaning of the tags that are contained within XML and HTML texts.

Let’s begin by creating a string that contains the entirety of our HTML. Then, in a later lesson, we will learn how to read text directly from a live website.

my_html = “””

<html>

<head>

<title>My HTML page</title>

</head>

<body>

<p>

This is one paragraph.

</p>

<p>

This is another paragraph. <b>HTML</b> is cool!

</p>

<div>

<a href=”https://blog.dominodatalab.com/”

id=”dominodatalab”>Domino Datalab Blog</a>

</div>

</body>

</html>”””

Now as we have the facility of “get inner div using beautiful soup”, we can import it and read the string as shown in the followings:

from bs4 import BeautifulSoup

html_soup = BeautifulSoup(my_html, ‘html.parser’)

Look inside html soup, and you’ll see that it’s just like any other boring normal file:

print(html_soup)

<html>

<head>

<title>My HTML page</title>

</head>

<body>

  <p>

    This is one paragraph.

  </p>

  <p>

    This is another paragraph. <b>HTML</b> is cool!

  </p>

  <div>

    <a href=”https://blog.dominodatalab.com/” id=”dominodatalab”>Domino Datalab Blog</a>

  </div>

</body>

</html>

However, there’s more to it than what you may be thinking at this point. When you look at the data type of the html soup variable, you’ll notice that it is no longer a string. You can probably guess why. Instead, it is an object of type BeautifulSoup, specifically:

type(html_soup)

bs4.BeautifulSoup

Beautiful Soup provides us with assistance in understanding the tags that are present in the HTML file that we are working on. It analyses the document in order to find the inner tags that are relevant. For example, we may inquire about the title of the webpage in a direct manner:

print(html_soup.title)

<title>My HTML page</title>

Or for the text inside the title tag:

print(html_soup.title.text)

‘My HTML page’

In a similar manner, we can examine the children tags of the <body>, which are as follows:

list(html_soup.body.children)

[‘\n’,

<p>

This is one paragraph.

</p>,

‘\n’,

<p>

This is another paragraph. <b>HTML</b> is cool!

</p>,

‘\n’,

<div>

<a href=”https://blog.dominodatalab.com/” id=”dominodatalab”>Domino Datalab Blog</a>

</div>,

‘\n’]

From this viewpoint, we can select the first paragraph’s content. According to the list that was just presented, we can deduce that it is the second item on the list. Keep in mind that Python starts counting from zero. Therefore the first element is of particular relevance to us:


print(list(html_soup.body.children)[1])

<p>

This is one paragraph.

</p>

It works just good, but “get inner dive using Beautiful Soup” will be of considerably greater assistance to us. By way of illustration, we may locate the initial paragraphs by linking to the <p> in the following manner:


print(html_soup.find(‘p’).text.strip())

‘This is one paragraph.’

We also have the option of searching for every incidence of the paragraph:

for paragraph in html_soup.find_all(‘p’):

print(paragraph.text.strip())

This is one paragraph.

Another paragraph is shown here. HTML is quite sweet!


First, let us acquire the hyperlink that is mentioned in the HTML example that we have. We are able to accomplish this by making a request for all of the tags that include a href:

links = html_soup.find_all(‘a’, href = True)

print(links)

[<a href=”https://blog.dominodatalab.com/” id=”dominodatalab”>Domino Datalab Blog</a>]

The list links’ contents in this scenario are the tags themselves. Our list consists of just one item, and we are able to observe the type of that item:

print(type(links[0]))

bs4.element.Tag

We can therefore request the attributes href and id as follows:

print(links[0][‘href’], links[0][‘id’])

(‘https://blog.dominodatalab.com/’, ‘dominodatalab’)

Reading the source code of a webpage

Getting data from a real website is the next step, so let’s investigate that today. Again, the Requests component will help us accomplish this.


import requests

url = “https://blog.dominodatalab.com/data-exploration-with-pandas-profiler-and-d-tale”

my_page = requests.get(url)

A successful request of the page will return a response 200 :

my_page.status_code

200

my page.content will display the page’s actual content. I won’t demonstrate to avoid cluttering up this post, I won’t demonstrate, but feel free to give it a shot in your own setup if you’re interested. Finally, we are o going to use get inner div using beautiful soup so that we may understand the document’s tags:

blog_soup = BeautifulSoup(my_page.content, ‘html.parser’)

Let’s take a look at the h1 heading tag, which holds the page’s actual heading:

blog_soup.h1

<h1 class=”title”>

  <span class=”hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text” data-hs-cos-general-type=”meta_field” data-hs-cos-type=”text” id=”hs_cos_wrapper_name” style=””>

    Data Exploration with Pandas Profiler and D-Tale

  </span>

</h1>

It has a few properties, but we’re just interested in the content enclosed by the tag.

heading = blog_soup.h1.textprint(heading)

‘Data Exploration with Pandas Profiler and D-Tale’

Take a look at the div with the class “author-link” to see who wrote the post:

blog_author = blog_soup.find_all(‘div’, class_=”author-link”)

print(blog_author)

[<div class=”author-link”> by: <a href=”//blog.dominodatalab.com/author/jrogel”>Dr J Rogel-Salazar </a></div>]

To avoid confusion with the Python reserved term class, we must use the notation class_ (with an underscore). Based on the output shown above, we can deduce that the div contains a hyperlink and that the author’s name is included within the body of the hyperlink’s anchor tag:


blog_author[0].find(‘a’).text

‘Dr J Rogel-Salazar ‘

It is clear that we need to become deeply familiar with the page’s source code. Then, you can examine the specifics of a website by using the features provided by your preferred browser. 

Let’s pretend we’re interested in acquiring the above set of aims. An unordered list (<ul>) contains the data, while a list item (<li>) contains the individual items in the list. Unlike other lists on this page, the unordered list serves no special purpose.

blog_soup.find(‘ul’, class_=None, role=None)

<ul>

  <li>Detecting erroneous data.</li>

  <li>Determining how much missing data there is.</li>

  <li>Understanding the structure of the data.</li>

  <li>Identifying important variables in the data.</li>

  <li>Sense-checking the validity of the data.</li>

</ul>

Now that we have the HTML list’s entries, we can import them into a Python list.


my_ul = blog_soup.find(‘ul’, class_=None, role=None)

li_goals =my_ul.find_all(‘li’)

goals = []

for li_goal in li_goals:    

 v goals.append(li_goal.   string)

print(goals)

[‘Detecting erroneous data.’,

‘Determining how much missing data there is.’,

‘Understanding the structure of the data.’,

‘Identifying important variables in the data.’,

‘Sense-checking the validity of the data.’]

As was previously said, we might be interested in extracting the blog post’s text to perform some natural language processing. We can accomplish this using the get_text() function in a single pass.

blog_text = blog_soup.get_text()

Some of the natural language processing strategies have been adapted for work with spaCy. In this scenario, we display each entry together with its POS, the explanation for its POS, and the status of the entry as a stop word. For the sake of clarity, we are just displaying the first 10 results.  

import spacy

nlp = spacy.load(“en_core_web_sm”)

doc = nlp(blog_text)

for entry in doc[:10]:

  print(entry.text, entry.pos_, 

    spacy.explain(entry.pos_), 

    entry.is_stop)

 SPACE space False

Data PROPN proper noun False

Exploration PROPN proper noun False

with ADP adposition True

Pandas PROPN proper noun False

Profiler PROPN proper noun False

and CCONJ coordinating conjunction True

D PROPN proper noun False

– PUNCT punctuation False

Tale PROPN proper noun False

Reading table data

To conclude, let’s apply everything we’ve learned so far to compile information suitable for tabular presentations. In the introduction, we indicated that comparing the amount of gold medals won by each country at the Tokyo Olympics could be interesting. The relevant Wikipedia article will tell us everything we need to know.


url = ‘https://en.wikipedia.org/wiki/2020_Summer_Olympics_medal_table’

wiki_page = requests.get(url)

medal_soup = BeautifulSoup(wiki_page.content, ‘html.parser’)

Summary

With the aid of get inner div using beautiful soup, we have learned how to parse an Html page and decode the tags contained within. Take advantage of what you’ve experienced here to get information that might otherwise be restricted to a website. Do not forget to consider the rights of the content you are acquiring. If you’re not sure what to do, it’s safer to play it safe and read the disclaimers on the pages you’re considering visiting. Finally, web scraping depends on the target webpages’ pre-existing structure. Your code is probably going to break if the pages change. In this situation, you’ll need to get your hands dirty, check the HTML tags again, and make the necessary adjustments to the code.

Read This Article:Liteboxer Fitness Bundle

You may also like

I Conquered WordPress
Technology

Third Time Lucky How I Conquered WordPress

The Third Time Lucky How I Conquered WordPress. I’ll show you, for the third time lucky, how I conquered WordPress
YouTube Download MP3 for FREE
Technology

YouTube Download MP3 for FREE- 3 Simple Ways

YouTube is not just a platform where people share videos or watch videos. It also has a huge music library