Saturday, August 1, 2015

#Web Scraping #Beginner #Python

Hi all, I'm back with some other new area, #Web Scraping. This is actually not a new technology, but new for my blogging scope, and to me too at the time.. :-) 

The expected audience is beginners for web scraping, and also if you are a beginner for Python, the free ticket (to read this blog... ha ha...) would be precious. I would explain how to scrape a web site using a typical example and don't worry if you are not familiar with Python, believe me I would teach the most basics of Python here..yes, it is simple. Anyway I am also a beginner for Python and so comments and suggestions are highly appreciated.

[Web scraping is extracting useful information from a web site]
Following is the url of the web site (of www.imdb.com) I am going to demonstrate here.
It is taken by getting the "Most Popular Feature Films Released 2005 to 2015 With User Rating Between 1.0 And 10" via advanced search option in imdb. (To go to the Advanced search option click on the small drop down arrow at the left side of search button on the top of the home page of www.imdb.com. Then click on the "Advanced Title Search" link at the right side of the page under the heading "Advanced Search". Now u have come into the advanced title search page and so give the search criterias Title Type= Feature Film, Release Date= From 2005 to 2015, User Rating = 1-10 and hit on search. This will bring u to the latter url where we are going to execute our grand mission :-)  )


Above image shows the web page we are going to scrape.
And the source code of that web page can be accessed by viewing the page source. (If u are a chrome user just Right click and select "View Page Source") 
Ok. Now let's see how the scraping could be implemented using BeautifulSoup python library, in python.
The task is to get the details of all the movies including their title, genries, year, run time and rating.
I am not describing here how to configure python and BeautifulSoup and I hope you have done up to that point successfully.

Now let's dig into the code, which does the task.

ScrapeImdb.py
#from bs4 import BeautifulSoup
import bs4
from urllib.request import urlopen
#pass the URL
url = urlopen("http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10")
#read the source from the URL
readHtml = url.read()
#close the url
url.close()
#passing HTML to scrape it
soup = bs4.BeautifulSoup(readHtml, 'html.parser')

tableClassResults = soup.find("table", { "class" : "results" })
for row in tableClassResults.find_all('tr'):
    print("\n")
    title=row.find_all('td',{"class":"title"})
    for titletd in title:
        print("Title:"+titletd.a.string)
        print("Year:"+titletd.find(class_="year_type").string)
        genreClass=titletd.find(class_="genre")
        print("Genries:")
        for eachGenre in genreClass.find_all('a'):
            print("\t"+eachGenre.string)
        print("Run Time:"+titletd.find(class_="runtime").string)
        rating_rating=titletd.find(class_="rating-rating")
        ratingValue=rating_rating.find(class_="value")
        print("Rating:"+ratingValue.string)

I will describe the code line by line. Note that in Python blocks and statements are delimited by just the indentation, which is an unusual method among popular programming languages. So you won't see semicolons (;) or curly braces({,{) as most other languages. And note that line comments are started with # in python.
 
from bs4 import BeautifulSoup
This Imports the name 'BeautifulSoup' from the BeautifulSoup Module bs4. A module in python is actually a file where definitions (functions and variables) are stored. Another definition for a module is "A module is a file containing Python definitions and statements". BeautifulSoup is a class and it is definitely defined in the bs4 module. Anyway now from here u can use BeautifulSoup constructor directly.
from urllib.request import urlopen
Here the 'request' is a class and request.py is a python file in the 'urllib' module. 'urlopen' 
is a function and it is dirrectly accessible now as it is imported. urlOpen() function open 
the URL, which can be either a string or a Request object.
So two way of python importing are, 1- from moduleName import ClassName 2- from moduleName.className import functionName
 
soup = BeautifulSoup(readHtml, 'html.parser')
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. Anyway we use here the html.parser. Here the BeautiFulSoup class constructor is called and it is given the arguments as the html file read from the given url previously and the name of the parser as string. And this return the BeautifulSoup object to the variable itsoup. Python is a Dinamically typed language and so we do not have to define a type for variables. Python automatically decide the type of a varibale at runtime.
Now we have a some kind of object which hold the html file we are scraping.
tableClassResults = soup.find("table", { "class" : "results" })
So now the task is to find the elements we want and extract information from them  
sequentially.In the source of the web page you can notice that there is only one <table> 
element which has the class as "results". Search results in imdb are actually in this table.
So the function find with above parameters, finds for a tag <table> having the attribute 
class with the value "results" and returns that <table> element.
for row in tableClassResults.find_all('tr'):
In the 'results' table each movie is described in separate <tr> elements. So now we have to 
extract these <tr> elements.
The find_all() method looks through a tag’s descendants and retrieves all descendants that matches the given filter. So calling find_all function on the tableClassResults object (which is now carrying a <table> element), with the argument 'tr' returns a list of all the <tr> elements in the <table> element.
So we are using for loop to iterate through all the <tr> elements returned and you can see the syntax for the 'for loop' is pretty simple. It is like the enhanced for loop (for each loop) in java. 'row' is the variable now that catches the <tr> element in each iteration.
So now we can grab the <td> element of the movie (that is in the row in current iteration) which has the class attribute as 'title' in which all the required details of the movie are containing. 
title=row.find_all('td',{"class":"title"})
Above is how it is done using find_all method. Though we use find_all method, the variable title is assigned a list containing only one <td> element, because it has only one such with class "title". But we used find_all method for the convenience for the next iteration.
for titletd in title:
This is how we iterate through the list title, even though it has only one item, for the convenience of accessing children tags in it.
print("Title:"+titletd.a.string)
Now we can directly retrieve the title(name) of the movie. A child tag of a tag can be directly accessed by using a period(.). The contents of that element is given by the string attribute. So here, the content text of the <a> element is returned by titletd.a.string
<a href="/title/tt0478970/">Ant-Man</a>
Note that writing on the standard output(console) is done by print function in python.
print("Year:"+titletd.find(class_="year_type").string)
titletd.find(class_="year_type"
This statement searches for an element with the class value "year_type" and its content text is retrieved by getting the string attribute(python object's attribute). Note the underscore after class (class_) and please don't forget it.
 All the next statements are similar to the ones I have described before. Anyway you should peek into the source code of the web page to understand this well. But for convenience I have added screenshots of relevant code parts. The first lines of the output too are displayed below.
So that is it...!!! You have scraped a web site. Now try to implement this in an advanced application.  
Output:
Title:Mission: Impossible - Rogue Nation
Year:(2015)
Genries:
 Action
 Adventure
 Thriller
Run Time:131 mins.
Rating:8.0


Title:Southpaw
Year:(2015)
Genries:
 Action
 Drama
 Sport
 Thriller
Run Time:124 mins.
Rating:7.9


Title:Ant-Man
Year:(2015)
Genries:
 Action
 Adventure
 Sci-Fi
Run Time:117 mins.
Rating:7.8


20 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Hello. I launch you sctript on Python 2.7.10. But for corectlu work need fix some line:

    import bs4
    import urllib2
    #pass the URL
    url = urllib2.urlopen("http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10")
    #read the source from the URL
    readHtml = url.read()
    #close the url
    url.close()
    #passing HTML to scrape it
    soup = bs4.BeautifulSoup(readHtml, 'html.parser')

    tableClassResults = soup.find("table", { "class" : "results" })
    for row in tableClassResults.find_all('tr'):
    print("\n")
    title=row.find_all('td',{"class":"title"})
    for titletd in title:
    print("Title:"+titletd.a.string)
    print("Year:"+titletd.find(class_="year_type").string)
    genreClass=titletd.find(class_="genre")
    print("Genries:")
    for eachGenre in genreClass.find_all('a'):
    print("\t"+eachGenre.string)
    print("Run Time:"+titletd.find(class_="runtime").string)
    rating_rating=titletd.find(class_="rating-rating")
    ratingValue=rating_rating.find(class_="value")
    print("Rating:"+ratingValue.string)

    ReplyDelete
    Replies
    1. What is the error you get? (And as I see, the program u have shown in your comment is just same as the program I posted.)

      Delete
  3. Traceback (most recent call last):
    File "C:/Python27/133.py", line 2, in
    from urllib.request import urlopen
    ImportError: No module named request

    Maybe it error appear because I use urllib2... But now I can't find how to install urllib.request

    ReplyDelete
    Replies
    1. It should be because your python version is 2.7. If it was python 3.x I guess that error won't occur. So try to use, from urllib2 import urlopen and then directly use urlopen function.

      Delete
    2. U do not need to install urllib.request anyway. And other than above solution following too might work. import urllib
      url= urlliib.urlopen("http://www.imdb.com/....")

      Delete
  4. hallo can using php curl grabbing appspot site

    i have hosting and i want grab data site appspot but error

    ReplyDelete
  5. do you have email bos .... i want sent to you

    ReplyDelete
  6. Nice post and line by line explanation is very good.Web scraping is awesome and i love it.I am doing web scraping since last 6 years. Here is my website to look at : http://prowebscraping.com

    ReplyDelete
  7. Hi, thanks for your article. I get this error on PyCharm : NameError: name 'titletd' is not defined :(

    ReplyDelete
  8. Hi Samitha, I keep getting this when I run your script, any ideas?

    >>> import bs4
    Traceback (most recent call last):
    File "", line 1, in
    import bs4
    File "C:\Users\Samantha\AppData\Local\Programs\Python\Python35-32\lib\bs4\__init__.py", line 328
    print soup.prettify()
    ^
    SyntaxError: invalid syntax

    ReplyDelete
    Replies
    1. In Python 3 print is a function; so u should use, print(soup.prettify())
      And have u installed bs4 correctly?

      Delete
    2. I think I have installed bs4 correctly but I haven't been able to get it to work as this is the first time using that package, so its highly possible that its not. ill try an amend and see what happens. :) thanks for replying. S*

      Delete
    3. Installed a different Python compiler and I am away Samitha. Thank you.

      Delete
  9. Hi Samitha, when I copt/paste your script in Python 3.4, I keep getting this error:

    Traceback (most recent call last):
    File "C:\TEMP\IMDB.py", line 14, in
    for row in tableClassResults.find_all('tr'):
    AttributeError: 'NoneType' object has no attribute 'find_all'

    Any suggestions on how to resolve this?
    --Dan

    ReplyDelete

Comments are highly appreciated... :-)