Hi all, I'm back with some other new area, #Web Scraping. This is actually not a new technology, but new for my blogging scope, and to me too at the time.. :-)
The expected audience is beginners for web scraping, and also if you are a beginner for Python, the free ticket (to read this blog... ha ha...) would be precious. I would explain how to scrape a web site using a typical example and don't worry if you are not familiar with Python, believe me I would teach the most basics of Python here..yes, it is simple. Anyway I am also a beginner for Python and so comments and suggestions are highly appreciated.
[Web scraping is extracting useful information from a web site]
Following is the url of the web site (of www.imdb.com) I am going to demonstrate here.
It is taken by getting the "Most Popular Feature Films Released 2005 to 2015 With User Rating Between 1.0 And 10" via advanced search option in imdb. (To go to the Advanced search option click on the small drop down arrow at the left side of search button on the top of the home page of www.imdb.com. Then click on the "Advanced Title Search" link at the right side of the page under the heading "Advanced Search". Now u have come into the advanced title search page and so give the search criterias Title Type= Feature Film, Release Date= From 2005 to 2015, User Rating = 1-10 and hit on search. This will bring u to the latter url where we are going to execute our grand mission :-) )
Above image shows the web page we are going to scrape.
And the source code of that web page can be accessed by viewing the page source. (If u are a chrome user just Right click and select "View Page Source")
Ok. Now let's see how the scraping could be implemented using BeautifulSoup python library, in python.
The task is to get the
details of all the movies including their title, genries, year, run time and rating.
I am not describing here how to configure python and BeautifulSoup and I hope you have done up to that point successfully.
Now let's dig into the code, which does the task.
ScrapeImdb.py
#from bs4 import BeautifulSoup import bs4 from urllib.request import urlopen #pass the URL url = urlopen("http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10") #read the source from the URL readHtml = url.read() #close the url url.close() #passing HTML to scrape it soup = bs4.BeautifulSoup(readHtml, 'html.parser') tableClassResults = soup.find("table", { "class" : "results" }) for row in tableClassResults.find_all('tr'): print("\n") title=row.find_all('td',{"class":"title"}) for titletd in title: print("Title:"+titletd.a.string) print("Year:"+titletd.find(class_="year_type").string) genreClass=titletd.find(class_="genre") print("Genries:") for eachGenre in genreClass.find_all('a'): print("\t"+eachGenre.string) print("Run Time:"+titletd.find(class_="runtime").string) rating_rating=titletd.find(class_="rating-rating") ratingValue=rating_rating.find(class_="value") print("Rating:"+ratingValue.string)
I will describe the code line by line. Note that in Python blocks and statements are delimited by just the indentation, which is an unusual method among popular programming languages. So you won't see semicolons (;) or curly braces({,{) as most other languages. And note that line comments are started with # in python.
from bs4 import BeautifulSoup
This Imports the name 'BeautifulSoup' from the BeautifulSoup Module bs4. A module in python is actually a file where definitions (functions and variables) are stored. Another definition for a module is "A module is a file containing Python definitions and statements". BeautifulSoup is a class and it is definitely defined in the bs4 module. Anyway now from here u can use BeautifulSoup constructor directly.from urllib.request import urlopen
Here the 'request' is a class and request.py is a python file in the 'urllib' module. 'urlopen'So two way of python importing are, 1- from moduleName import ClassName 2- from moduleName.className import functionName
is a function and it is dirrectly accessible now as it is imported. urlOpen() function open
the URL, which can be either a string or a Request object.
soup = BeautifulSoup(readHtml, 'html.parser')
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. Anyway we use here the html.parser. Here the BeautiFulSoup class constructor is called and it is given the arguments as the html file read from the given url previously and the name of the parser as string. And this return the BeautifulSoup object to the variable itsoup. Python is a Dinamically typed language and so we do not have to define a type for variables. Python automatically decide the type of a varibale at runtime.
Now we have a some kind of object which hold the html file we are scraping.
tableClassResults = soup.find("table", { "class" : "results" })
So now the task is to find the elements we want and extract information from them
sequentially.In the source of the web page you can notice that there is only one <table>
element which has the class as "results". Search results in imdb are actually in this table.
So the function find with above parameters, finds for a tag <table> having the attribute
class with the value "results" and returns that <table> element.
for row in tableClassResults.find_all('tr'):
In the 'results' table each movie is described in separate <tr> elements. So now we have to
extract these <tr> elements.
The find_all() method looks through a tag’s descendants and retrieves all descendants that matches the given filter. So calling find_all function on the tableClassResults object (which is now carrying a <table> element), with the argument 'tr' returns a list of all the <tr> elements in the <table> element.
So we are using for loop to iterate through all the <tr> elements returned and you can see the syntax for the 'for loop' is pretty simple. It is like the enhanced for loop (for each loop) in java. 'row' is the variable now that catches the <tr> element in each iteration.
So now we can grab the <td> element of the movie (that is in the row in current iteration) which has the class attribute as 'title' in which all the required details of the movie are containing.
title=row.find_all('td',{"class":"title"})
Above is how it is done using find_all method. Though we use find_all method, the variable title is assigned a list containing only one <td> element, because it has only one such with class "title". But we used find_all method for the convenience for the next iteration.
for titletd in title:
This is how we iterate through the list title, even though it has only one item, for the convenience of accessing children tags in it.
print("Title:"+titletd.a.string)
Now we can directly retrieve the title(name) of the movie. A child tag of a tag can be directly accessed by using a period(.). The contents of that element is given by the string attribute. So here, the content text of the <a> element is returned by titletd.a.string
<a href="/title/tt0478970/">Ant-Man</a>
Note that writing on the standard output(console) is done by print function in python.
print("Year:"+titletd.find(class_="year_type").string)
titletd.find(class_="year_type")
This statement searches for an element with the class value "year_type" and its content text is retrieved by getting the string attribute(python object's attribute). Note the underscore after class (class_) and please don't forget it.
All the next statements are similar to the ones I have described before. Anyway you should peek into the source code of the web page to understand this well. But for convenience I have added screenshots of relevant code parts. The first lines of the output too are displayed below.
So that is it...!!! You have scraped a web site. Now try to implement this in an advanced application.
Output:
Title:Mission: Impossible - Rogue Nation
Year:(2015)
Genries:
Action
Adventure
Thriller
Run Time:131 mins.
Rating:8.0
Title:Southpaw
Year:(2015)
Genries:
Action
Drama
Sport
Thriller
Run Time:124 mins.
Rating:7.9
Title:Ant-Man
Year:(2015)
Genries:
Action
Adventure
Sci-Fi
Run Time:117 mins.
Rating:7.8