Saturday, August 1, 2015

#Web Scraping #Beginner #Python

Hi all, I'm back with some other new area, #Web Scraping. This is actually not a new technology, but new for my blogging scope, and to me too at the time.. :-) 

The expected audience is beginners for web scraping, and also if you are a beginner for Python, the free ticket (to read this blog... ha ha...) would be precious. I would explain how to scrape a web site using a typical example and don't worry if you are not familiar with Python, believe me I would teach the most basics of Python here..yes, it is simple. Anyway I am also a beginner for Python and so comments and suggestions are highly appreciated.

[Web scraping is extracting useful information from a web site]
Following is the url of the web site (of www.imdb.com) I am going to demonstrate here.
It is taken by getting the "Most Popular Feature Films Released 2005 to 2015 With User Rating Between 1.0 And 10" via advanced search option in imdb. (To go to the Advanced search option click on the small drop down arrow at the left side of search button on the top of the home page of www.imdb.com. Then click on the "Advanced Title Search" link at the right side of the page under the heading "Advanced Search". Now u have come into the advanced title search page and so give the search criterias Title Type= Feature Film, Release Date= From 2005 to 2015, User Rating = 1-10 and hit on search. This will bring u to the latter url where we are going to execute our grand mission :-)  )


Above image shows the web page we are going to scrape.
And the source code of that web page can be accessed by viewing the page source. (If u are a chrome user just Right click and select "View Page Source") 
Ok. Now let's see how the scraping could be implemented using BeautifulSoup python library, in python.
The task is to get the details of all the movies including their title, genries, year, run time and rating.
I am not describing here how to configure python and BeautifulSoup and I hope you have done up to that point successfully.

Now let's dig into the code, which does the task.

ScrapeImdb.py
#from bs4 import BeautifulSoup
import bs4
from urllib.request import urlopen
#pass the URL
url = urlopen("http://www.imdb.com/search/title?release_date=2005,2015&title_type=feature&user_rating=1.0,10")
#read the source from the URL
readHtml = url.read()
#close the url
url.close()
#passing HTML to scrape it
soup = bs4.BeautifulSoup(readHtml, 'html.parser')

tableClassResults = soup.find("table", { "class" : "results" })
for row in tableClassResults.find_all('tr'):
    print("\n")
    title=row.find_all('td',{"class":"title"})
    for titletd in title:
        print("Title:"+titletd.a.string)
        print("Year:"+titletd.find(class_="year_type").string)
        genreClass=titletd.find(class_="genre")
        print("Genries:")
        for eachGenre in genreClass.find_all('a'):
            print("\t"+eachGenre.string)
        print("Run Time:"+titletd.find(class_="runtime").string)
        rating_rating=titletd.find(class_="rating-rating")
        ratingValue=rating_rating.find(class_="value")
        print("Rating:"+ratingValue.string)

I will describe the code line by line. Note that in Python blocks and statements are delimited by just the indentation, which is an unusual method among popular programming languages. So you won't see semicolons (;) or curly braces({,{) as most other languages. And note that line comments are started with # in python.
 
from bs4 import BeautifulSoup
This Imports the name 'BeautifulSoup' from the BeautifulSoup Module bs4. A module in python is actually a file where definitions (functions and variables) are stored. Another definition for a module is "A module is a file containing Python definitions and statements". BeautifulSoup is a class and it is definitely defined in the bs4 module. Anyway now from here u can use BeautifulSoup constructor directly.
from urllib.request import urlopen
Here the 'request' is a class and request.py is a python file in the 'urllib' module. 'urlopen' 
is a function and it is dirrectly accessible now as it is imported. urlOpen() function open 
the URL, which can be either a string or a Request object.
So two way of python importing are, 1- from moduleName import ClassName 2- from moduleName.className import functionName
 
soup = BeautifulSoup(readHtml, 'html.parser')
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. Anyway we use here the html.parser. Here the BeautiFulSoup class constructor is called and it is given the arguments as the html file read from the given url previously and the name of the parser as string. And this return the BeautifulSoup object to the variable itsoup. Python is a Dinamically typed language and so we do not have to define a type for variables. Python automatically decide the type of a varibale at runtime.
Now we have a some kind of object which hold the html file we are scraping.
tableClassResults = soup.find("table", { "class" : "results" })
So now the task is to find the elements we want and extract information from them  
sequentially.In the source of the web page you can notice that there is only one <table> 
element which has the class as "results". Search results in imdb are actually in this table.
So the function find with above parameters, finds for a tag <table> having the attribute 
class with the value "results" and returns that <table> element.
for row in tableClassResults.find_all('tr'):
In the 'results' table each movie is described in separate <tr> elements. So now we have to 
extract these <tr> elements.
The find_all() method looks through a tag’s descendants and retrieves all descendants that matches the given filter. So calling find_all function on the tableClassResults object (which is now carrying a <table> element), with the argument 'tr' returns a list of all the <tr> elements in the <table> element.
So we are using for loop to iterate through all the <tr> elements returned and you can see the syntax for the 'for loop' is pretty simple. It is like the enhanced for loop (for each loop) in java. 'row' is the variable now that catches the <tr> element in each iteration.
So now we can grab the <td> element of the movie (that is in the row in current iteration) which has the class attribute as 'title' in which all the required details of the movie are containing. 
title=row.find_all('td',{"class":"title"})
Above is how it is done using find_all method. Though we use find_all method, the variable title is assigned a list containing only one <td> element, because it has only one such with class "title". But we used find_all method for the convenience for the next iteration.
for titletd in title:
This is how we iterate through the list title, even though it has only one item, for the convenience of accessing children tags in it.
print("Title:"+titletd.a.string)
Now we can directly retrieve the title(name) of the movie. A child tag of a tag can be directly accessed by using a period(.). The contents of that element is given by the string attribute. So here, the content text of the <a> element is returned by titletd.a.string
<a href="/title/tt0478970/">Ant-Man</a>
Note that writing on the standard output(console) is done by print function in python.
print("Year:"+titletd.find(class_="year_type").string)
titletd.find(class_="year_type"
This statement searches for an element with the class value "year_type" and its content text is retrieved by getting the string attribute(python object's attribute). Note the underscore after class (class_) and please don't forget it.
 All the next statements are similar to the ones I have described before. Anyway you should peek into the source code of the web page to understand this well. But for convenience I have added screenshots of relevant code parts. The first lines of the output too are displayed below.
So that is it...!!! You have scraped a web site. Now try to implement this in an advanced application.  
Output:
Title:Mission: Impossible - Rogue Nation
Year:(2015)
Genries:
 Action
 Adventure
 Thriller
Run Time:131 mins.
Rating:8.0


Title:Southpaw
Year:(2015)
Genries:
 Action
 Drama
 Sport
 Thriller
Run Time:124 mins.
Rating:7.9


Title:Ant-Man
Year:(2015)
Genries:
 Action
 Adventure
 Sci-Fi
Run Time:117 mins.
Rating:7.8


Wednesday, July 8, 2015

More on semantic wiki...


              In my previous post with the title "Semantics into wiki", I discussed the major concepts behind the "Semantic wiki". Today I am going to dive bit deeper in it.
              Semantic wiki has the following basic features. One is that it is still a wiki, with regular wiki features such as  Category/Tags, Namespaces, Title, Versioning, etc... The articles have typed Content (built-ins + user created, e.g. categories) and types can be of Page/Card, Date, Number, URL/Email, String, etc…. The articles are connected with typed Links (e.g. properties) such as “capital_of”, “contains”, “born_in”… Some semantic wikis has Querying Interface Support too.

            Annotations are used in semantic wiki to make information more explicit, which is actually the most important of semantic wikis. These annotations have specific markup syntax. This markup syntax is used to edit or add articles into the wiki. These syntaxes might differ in different semantic wikis and in my this article I’m focusing on semantic media wiki. Categories, typed links and attributes are some of these annotations. “Category” is a one type which already exist in normal Wikipedia too.

Typed links are used instead of regular hyperlinks. In here an hyperlinks has a type. Links are arguably the most basic and also most relevant markup within a wiki. Their syntactic representation is ubiquitous in the source of any Wikipedia article. MediaWiki allows users to create new typed links freely as they prefer. Existing link types should be used wherever applicable, but a new type can also be created simply by using it in a link. A typed link can be a property of the current article and the syntax of inserting a property is,
[[Property::Value | Display]]

For example [[is capital of::England]]. Here the Property is “is capital” and it is linked to the article with the name “England” and that is the Value. And Display part is additional and there we can mention if something other than the value should be displayed on the article.   
 
Data values play a crucial role within an encyclopaedia, and machine access to this data yields numerous additional applications. These are called attributes and has the common syntax 
[[ attribute_name := value]]   
in the semantic media wiki. Eg: [[ population := 7,421,328 ]]

There can be an unit for an attribute value. Eg: [[area:=609 square miles]]. When many types of units are there for a same value, the system provides automatic conversion of a value to various other units. To allow users to declare the data type of an attribute, we introduce a new Wikipedia namespace “Attribute:” that contains articles on attributes. Within these articles, one can provide human-readable descriptions as in the case of relations and categories, but one can also add semantic information that specifies the data type. Using a relation with built-in semantic we can simply write,               [[hasType::Type:integer]] 
to denote that an attribute has this type.

            So as these, there is much more semantic wiki syntax types and it is important to learn all these, if we want to add a new article to a semantic wiki or edit currently available article.

Advanced Querying and Searching are some most important features in semantic wiki. There is a feature to search a property by its value in “Page property search”. There if we insert the property as “Located in” and value as “England” all the cities/regions located in England are listed via this advanced searching option. For advanced querying too there is a nice interface. For example if we give [[Category:City]][[located in::Germany]] into the Query field and  ?Population into the Additional data to display field a list of all the cities in Germany will be displayed with their population values. Following is the interface  available for querying.

The results of the query are displayed as follows.
 

Following is the basic architecture of semantic media wiki.

There are more applications of Semantic wiki such as,
      Desktop applications
o   AmaroK Media Player
o   Movie reviewer
o   Portals that aggregate data from various data sources (newsfeeds, blogs, online services)
      Over enhanced folksonomies
      Creating domain ontologies,
      Creation of multilingual dictionaries
      New re-search opportunities

As this way it is obvious that semantic wiki concept is going to be a very interesting and valuable concept to the whole world even though currently it is not much developed or popular. 


Wednesday, July 1, 2015

Create a git BitBucket/ Github repository from already locally existing project



  • This post is originally targeted at BitBucket repositories, but the basic steps are common to Github too. 
  • This is a simple issue but would get hours if followed the official Bitbucket.com instructions.  :P .So I am posting this.
  • The case is that we have a project in our pc and it is almost completed (or partially done) and now we want to include it in a git repository and push it to a bitbucket repository. So note that we do not have still a BitBucket / Github repository for our project or a local git repo in our pc too. (but of course we have a bitbucket/github account :) )
  • So, believe me.. follow these simple steps.
Pre-requisite: You should have installed git into your pc(PC=Personal Computer>>simply your computer).

1) Create a repository in Bitbucket.org / Github.com to contain our project.

  • For this just click Repositories tab> Create New Repository (or simply click this link- https://bitbucket.org/repo/create)
  • Simply fill the details you want and guess we filled the name as "TestingGitRepo"
  • Click "Create Repository"

Now u have done with creating bitbucket repo.
(If u are dealing with Github, instead of Bitbucket, create the repository as the default way and remember not to tick "
Initialize this repository with a README" ,because it will cause bit hard when doing the initial commit later.)

Just after creating the repo you will be redirected to a page as below.

Click "I have an existing project" and copy the command displayed below.
(It will be easier if you copy it here now)

The command we copied is,

git remote add origin https://Samitha@bitbucket.org/Samitha/testinggitrepo.git

(If u are using Github, a similar command with the starting part "git remote add origin"
will be displayed in the following page after the creation of repo, and just copy it)
2) Open command prompt in your PC and go inside to the directory where u want to be as a repository.
   For example if you go into the directory "G:\AndroidStudioWorkspace", the contents in that directory will be sent to the bitbucket/github repository you created.

3) Enter,
       git init

    This will initialize this directory as a git repo.
4) Now paste the command we copied and press enter.
  git remote add origin https://Samitha@bitbucket.org/Samitha/testinggitrepo.git

5) Enter,  
git add --all

This will add all the files and folders in this directory into the git repository.

6)Now you have to make the initial commit. So enter,  
git commit -m "Initial Commit"

At here sometimes there will give an error message as follows if you are using git in your PC fresh and so have not configured your Bit Bucket account with the git.



If you get this error just do what has been asked to do.
Enter,
git config --global user.email "rmschathuranga@gmail.com"
git config --global user.name "Samitha"

Note that you have to use your email address and bit bucket user name instead of 
rmschathuranga@gmail.com and Samitha (which are MINE)..!!!
7) Now enter,  
git push -u origin master
Enter the password of your bitbucket/github account when prompted.So then all the files and folders in your local repo will be pushed (uploaded) to the bitbucket repo, creating a new branch with the name "master". You will see messages as below,



     And that's all. You have done it.
Go check in the bitbucket/github repository you created. Your project has been successfully uploaded into the bitbucket repo. And the repository is successfully created. 

Important Notes:


  • Whenever you make changes in your local project files, and want to push the changes into the remote bitbucket/github repository just follow above 5,6,7 steps.
  • Note that if a directory is empty, that will not be added to git (to the remote repo too).
  • If you want deeper clarification, anyway Git doesn't ignore empty directories. It typically ignores all directories. In Git, directories exist only implicitly, through their contents. Empty directories have no contents, therefore they don't exist in git repositories.


--------------------------------------------------------------------------------------------------------------

For extra knowledge
---------------------------------------------------------------------------------------------------------------
git add command

For extra knowledge I would like to go deep on git add command.

git add has number of options for various requirements. Following tables (extracted from http://certificationquestions.com/version-control-system/git/difference-git-add-git-add-git-add-u/ ) clearly shows the difference between them. Note that,
git add -A  = git add --all 

You can find your git version by git version command

For git versions 1.x

For git version 2.x

So my recommendation is to use git add --all which is similar to git add -A as it is the most common and general requirement. 

And here I am highlighting the difference between,
           git add . and git add --all 
in git version 1.x which most of us use now. It is that git add --all stages all the changes to the repository, while git add . do not stage deleted files. It means that if u had deleted a file in your local repository and u want it to be deleted from the remote repository too, u should do git add --all But git add . would not remove that file from the remote repository.

Comments a and suggestions are highly appreciated if u found this post useful..!!! :-) 

Tuesday, June 23, 2015

Semantic Wiki - Semantics into Wiki


This is my first blog post and I am geared up to give you folks an abstract idea and a motivation on The Semantic Wiki which is a much needed concept to the world of information and knowledge, but still in the baby age.


This  concept is rather  an  amateur concept with a history of not more than 15 years, first proposed in the early 2000s and implemented seriously around 2005. It is something like the concept of semantic web (Semantic web is simply a broader concept and knowledge area related to storing meaning of data with Ontologies) injected into the formal wikis where the textual information in the wikis carry meaning and more meaningful linkages between pages are constructed while the information in the wiki can be queried as in a database through semantic queries. 

What is wiki?

Before jumping into the ocean directly let us clarify what is a wiki simply. Wikipedia itself, which is a wiki,  defines “Wiki” as “an application, typically a  web application, which allows  collaborative  modification, extension, or deletion  of its content and structure”.  The word “collaborative” is the most important fact here. A wiki article has no single defined owner or editor.

It is nourished by the knowledge of millions of people all over the world.  The word “wiki” is 
actually a Hawaiian word meaning “quick”. The following wiki principles  clearly defines a wiki 
as it is.

1.  Wikis allow anyone to edit
2.  Easy to use and do not require additional software
3.  Content is easy to link
4.  Support versioning of all changes
5.  Support all media

MediaWiki is  the world’s top wiki engine amongst others such as MoinMoin, PhpWiki, Xwiki, OddMuseWiki, etc.. and these wiki engines do the same implied task of acting as an ENGINE for a wiki, in the sense that wikis are built on these wiki engines.

What is Wikipedia?

Wikipedia is the world’s most popular wiki and it is based on the wiki engine MediaWiki. Term “pedia” carries the meaning of “encyclopedia” and  wiki+edia sums up as a ‘quick encyclopedia’. MediaWiki was developed by the Wikipedia community. Wikipedia is availableon the web under a free licence. This Wikipedia was created by Jimbo Wales and Larry Sanger in January 2001. Wikipedia is having articles in 287 languages/editions and English is the largest. It has over 4.7 million articles in English in the Wikipedia.

Further Requirements Emerge

Wiki is undoubtedly a valuable source of universal data, information and knowledge though some limitations came to be obvious comparing to other modern world data stores, databases, big data, etc… The biggest issue is related to how to query the information as you like. How to extract certain constrained information filtered with certain parameters via Wikipedia?

For example we know Wikipedia has articles about all cities, their populations, their mayors, th skycrappers, etc… So can you ask from Wikipedia for a list of world’s 5 largest cities with a female mayor? ..or Skycrapppers in Shanghai with 50+ floors and built after 2000 ? Certainly no. Such queries are not supported by Wikipedia other than a simple text search. So in order to answer this issue, the Semantic web developers implemented their concept in the wikis.

What is Semantic Wiki?

Semantic wikis said to have combined the strengths of both Semantic wiki and wiki. Semantic web is machine processable, consist of integrated data and it supports complex queries. Wiki in a nut shell, is easy to use, contribute and collaborate, strongly interconnected. So the cream of both 2 are in this Semantic wiki. Examples for some semantic wikis are Acetic, ArtificialMemory, Wagn,  Knoodl, KiWi,  OntoWiki, Semantic MediaWiki. The Semantic MediaWiki is  the best-known semantic wiki software, and the only one with significant usage on public websites. It is an extension to MediaWiki that turns it into a semantic wiki.

If clarify further, guess in a normal wiki there is an article about London. (refer the following figure) It has hyperlinks to several other pages on articles such as England, United Kindom, New York City. But if we apply semantics into it, these links would have meaning. i.e. London and England is connected as London “is capital of” England. “is capital of” is the relation(link type) which link “London” page with “England” page. It is actually a property of London. So even very simple search algorithms would then suffice to provide a precise answer to the question “What is the capital of England?”



Two Perspectives

There are two significant perspectives on semantic wiki. They are Wikis for meta data and meta data for wiki.
  •  Wikis for Meta Data
              If a semantic wiki was successfully and completely created, those wikis can be used to create     metadata and ontology for the use of semantic web.
  •   Metadata for wiki

             Even though there is Huge amounts of digital content (e.g. Wikipedia) with strong                      connection   of content via hyperlinks, creating metadata from them is extremely time                consuming and so difficult.

    Strength of Semantic Wikis



According to the above diagram we can see that the semantic wikis have all the strengths of Semantic Web, Metaweb, Web and Social Software. And we can see that it has a highest degree of both social connectivity and information connectivity. So doesn't this itself give the hint for us to learn, use, develop, support this edgy technology.

...and mmm... that's it for now guys.. My next post would make you dive a bit deeper into the basic concepts of semantic wikis. Be hopeful.. ;-)

Hope you learned something. Comments and suggestions are highly appreciated and please share this with your friends, if you found this useful. Cheers...!

Read my next post More on Semantic Wiki...