SamRanga's Thoughts: #Web Scraping #Beginner #Python

Hi all, I'm back with some other new area, #Web Scraping. This is actually not a new technology, but new for my blogging scope, and to me too at the time.. :-)

The expected audience is beginners for web scraping, and also if you are a beginner for Python, the free ticket (to read this blog... ha ha...) would be precious. I would explain how to scrape a web site using a typical example and don't worry if you are not familiar with Python, believe me I would teach the most basics of Python here..yes, it is simple. Anyway I am also a beginner for Python and so comments and suggestions are highly appreciated.

[Web scraping is extracting useful information from a web site]

Following is the url of the web site (of www.imdb.com) I am going to demonstrate here.

http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10

It is taken by getting the "Most Popular Feature Films Released 2005 to 2015 With User Rating Between 1.0 And 10" via advanced search option in imdb. (To go to the Advanced search option click on the small drop down arrow at the left side of search button on the top of the home page of www.imdb.com. Then click on the "Advanced Title Search" link at the right side of the page under the heading "Advanced Search". Now u have come into the advanced title search page and so give the search criterias Title Type= Feature Film, Release Date= From 2005 to 2015, User Rating = 1-10 and hit on search. This will bring u to the latter url where we are going to execute our grand mission :-) )

Above image shows the web page we are going to scrape.

And the source code of that web page can be accessed by viewing the page source. (If u are a chrome user just Right click and select "View Page Source")

view-source:http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10

Ok. Now let's see how the scraping could be implemented using BeautifulSoup python library, in python.

The task is to get the details of all the movies including their title, genries, year, run time and rating.

I am not describing here how to configure python and BeautifulSoup and I hope you have done up to that point successfully.

Now let's dig into the code, which does the task.

ScrapeImdb.py

#from bs4 import BeautifulSoup
import bs4
from urllib.request import urlopen
#pass the URL
url = urlopen("http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10")
#read the source from the URL
readHtml = url.read()
#close the url
url.close()
#passing HTML to scrape it
soup = bs4.BeautifulSoup(readHtml, 'html.parser')

tableClassResults = soup.find("table", { "class" : "results" })
for row in tableClassResults.find_all('tr'):
    print("\n")
    title=row.find_all('td',{"class":"title"})
    for titletd in title:
        print("Title:"+titletd.a.string)
        print("Year:"+titletd.find(class_="year_type").string)
        genreClass=titletd.find(class_="genre")
        print("Genries:")
        for eachGenre in genreClass.find_all('a'):
            print("\t"+eachGenre.string)
        print("Run Time:"+titletd.find(class_="runtime").string)
        rating_rating=titletd.find(class_="rating-rating")
        ratingValue=rating_rating.find(class_="value")
        print("Rating:"+ratingValue.string)

I will describe the code line by line. Note that in Python blocks and statements are delimited by just the indentation, which is an unusual method among popular programming languages. So you won't see semicolons (;) or curly braces({,{) as most other languages. And note that line comments are started with # in python.

from bs4 import BeautifulSoup

This Imports the name 'BeautifulSoup' from the BeautifulSoup Module bs4. A module in python is actually a file where definitions (functions and variables) are stored. Another definition for a module is "A module is a file containing Python definitions and statements". BeautifulSoup is a class and it is definitely defined in the bs4 module. Anyway now from here u can use BeautifulSoup constructor directly.

from urllib.request import urlopen

Here the 'request' is a class and request.py is a python file in the 'urllib' module. 'urlopen' 

is a function and it is dirrectly accessible now as it is imported. urlOpen() function open 

the URL, which can be either a string or a Request object.

So two way of python importing are,
    1- from moduleName import ClassName
    2- from moduleName.className import functionName

soup = BeautifulSoup(readHtml, 'html.parser')

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. Anyway we use here the html.parser. Here the BeautiFulSoup class constructor is called and it is given the arguments as the html file read from the given url previously and the name of the parser as string. And this return the BeautifulSoup object to the variable itsoup. Python is a Dinamically typed language and so we do not have to define a type for variables. Python automatically decide the type of a varibale at runtime.

Now we have a some kind of object which hold the html file we are scraping.
tableClassResults = soup.find("table", { "class" : "results" })
So now the task is to find the elements we want and extract information from them  

sequentially.In the source of the web page you can notice that there is only one <table> 

element which has the class as "results". Search results in imdb are actually in this table.

So the function find with above parameters, finds for a tag <table> having the attribute 

class with the value "results" and returns that <table> element.
for row in tableClassResults.find_all('tr'):

In the 'results' table each movie is described in separate <tr> elements. So now we have to 

extract these <tr> elements.

The find_all() method looks through a tag’s descendants and retrieves all descendants that matches the given filter. So calling find_all function on the tableClassResults object (which is now carrying a <table> element), with the argument 'tr' returns a list of all the <tr> elements in the <table> element.

So we are using for loop to iterate through all the <tr> elements returned and you can see the syntax for the 'for loop' is pretty simple. It is like the enhanced for loop (for each loop) in java. 'row' is the variable now that catches the <tr> element in each iteration.

So now we can grab the <td> element of the movie (that is in the row in current iteration) which has the class attribute as 'title' in which all the required details of the movie are containing. 

title=row.find_all('td',{"class":"title"})
Above is how it is done using find_all method. Though we use find_all method, the variable title is assigned a list containing only one <td> element, because it has only one such with class "title". But we used find_all method for the convenience for the next iteration.
for titletd in title:
This is how we iterate through the list title, even though it has only one item, for the convenience of accessing children tags in it.
print("Title:"+titletd.a.string)
Now we can directly retrieve the title(name) of the movie. A child tag of a tag can be directly accessed by using a period(.). The contents of that element is given by the string attribute. So here, the content text of the <a> element is returned by titletd.a.string
<a href="/title/tt0478970/">Ant-Man</a>
Note that writing on the standard output(console) is done by print function in python.

print("Year:"+titletd.find(class_="year_type").string)
titletd.find(class_="year_type") 
This statement searches for an element with the class value "year_type" and its content text is retrieved by getting the string attribute(python object's attribute). Note the underscore after class (class_) and please don't forget it.




 All the next statements are similar to the ones I have described before. Anyway you should peek into the source code of the web page to understand this well. But for convenience I have added screenshots of relevant code parts. The first lines of the output too are displayed below.

So that is it...!!! You have scraped a web site. Now try to implement this in an advanced application.  


Output:
Title:Mission: Impossible - Rogue Nation
Year:(2015)
Genries:
 Action
 Adventure
 Thriller
Run Time:131 mins.
Rating:8.0


Title:Southpaw
Year:(2015)
Genries:
 Action
 Drama
 Sport
 Thriller
Run Time:124 mins.
Rating:7.9


Title:Ant-Man
Year:(2015)
Genries:
 Action
 Adventure
 Sci-Fi
Run Time:117 mins.
Rating:7.8

26 comments:

UnknownAugust 23, 2015 at 11:29 PM
This comment has been removed by a blog administrator.
UnknownOctober 16, 2015 at 4:22 AM
Hello. I launch you sctript on Python 2.7.10. But for corectlu work need fix some line:

import bs4
import urllib2
#pass the URL
url = urllib2.urlopen("http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10")
#read the source from the URL
readHtml = url.read()
#close the url
url.close()
#passing HTML to scrape it
soup = bs4.BeautifulSoup(readHtml, 'html.parser')

tableClassResults = soup.find("table", { "class" : "results" })
for row in tableClassResults.find_all('tr'):
print("\n")
title=row.find_all('td',{"class":"title"})
for titletd in title:
print("Title:"+titletd.a.string)
print("Year:"+titletd.find(class_="year_type").string)
genreClass=titletd.find(class_="genre")
print("Genries:")
for eachGenre in genreClass.find_all('a'):
print("\t"+eachGenre.string)
print("Run Time:"+titletd.find(class_="runtime").string)
rating_rating=titletd.find(class_="rating-rating")
ratingValue=rating_rating.find(class_="value")
print("Rating:"+ratingValue.string)
UnknownOctober 16, 2015 at 7:22 AM
Traceback (most recent call last):
File "C:/Python27/133.py", line 2, in
from urllib.request import urlopen
ImportError: No module named request

Maybe it error appear because I use urllib2... But now I can't find how to install urllib.request
AnonymousOctober 25, 2015 at 9:52 PM
hallo can using php curl grabbing appspot site

i have hosting and i want grab data site appspot but error
mjpNovember 12, 2015 at 2:05 AM
do you have email bos .... i want sent to you
ProwebscrapingNovember 17, 2015 at 3:52 AM
Nice post and line by line explanation is very good.Web scraping is awesome and i love it.I am doing web scraping since last 6 years. Here is my website to look at : http://prowebscraping.com
AnonymousNovember 19, 2015 at 6:39 AM
Hi, thanks for your article. I get this error on PyCharm : NameError: name 'titletd' is not defined :(
UnknownFebruary 23, 2016 at 1:56 PM
Hi Samitha, I keep getting this when I run your script, any ideas?

>>> import bs4
Traceback (most recent call last):
File "", line 1, in
import bs4
File "C:\Users\Samantha\AppData\Local\Programs\Python\Python35-32\lib\bs4\__init__.py", line 328
print soup.prettify()
^
SyntaxError: invalid syntax
DanOctober 17, 2016 at 11:23 PM
Hi Samitha, when I copt/paste your script in Python 3.4, I keep getting this error:

Traceback (most recent call last):
File "C:\TEMP\IMDB.py", line 14, in
for row in tableClassResults.find_all('tr'):
AttributeError: 'NoneType' object has no attribute 'find_all'

Any suggestions on how to resolve this?
--Dan
Allea JohnJuly 2, 2018 at 1:37 AM
The post is very good, the way you explain each thing is very nice. HOW GOOD PYTHON IS FOR WEB SCRAPING do anyone know?
UnknownMarch 7, 2020 at 10:26 AM
hello , how we should modify this code if there are more than one URL's , for example 50 url's? program can read the urls from a txt file then read the data from web site and than export all the imported data to an excel file? I need a such program to import hundreds of products prices from a sales site. thx
ArchanaMarch 2, 2022 at 3:41 AM
This comment has been removed by the author.
ArchanaMarch 2, 2022 at 4:03 AM
Agree; Java is everywhere, in all devices, in all Operating systems. And Java will be there in the future because Java is more than 20 years old and has massive community support. If you are a freelance Core java developer, you probably search around the projects, register with Eiliana.com and get connected to global projects.
sam kirubakarMay 6, 2022 at 4:56 AM

Very Informative and creative contents. This concept is a good way to enhance the knowledge. thanks for sharing.
Continue to share your knowledge through articles like these, and keep posting more blogs. Web Scraping Services in USA

Comments are highly appreciated... :-)

Saturday, August 1, 2015

#Web Scraping #Beginner #Python

26 comments: