Data Analysis of IMDB Data
[youtube https://www.youtube.com/watch?v=mS3dzczv1ZQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&start=1&wmode=transparent]
We all are surrounded by data and it reveals lot of things to us to make our decisions and recommends the next steps. Data is collected from different sources such as Web, Database, log files etc. and then it is thoroughly cleaned and reshaped, and further used for analysis and explored to determine the hidden patterns and trends which is really essential for any business decision making, Extracting data from web is always easy with the help of API’s but what if website doesn’t provide any API’s, In such case, Web Scraping is an excellent way to extract the unstructured data from web and put that in structured format like excel,csv, database etc..
Web Scraping with Selenium Webdriver
- if there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.
- Any browser actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation
Using IPython Notebook
Some Imports
Open Page in Chrome Browser
Python Function to Scrap Data using Selenium Webdriver
Just provide the address(Xpath or any other locator) of the data to be extracted and Selenium webdriver extracts all the data from the page just with the help of one api(find_element_by_xpath), See how easy it is. Isn’t it?
Data is extracted from the page with the help of webdriver and is stored in a list, So individual list for all the following data is created:
- Movie Name
- Release Year
- IMDB Rating
- Votes
- Director
- Genre
How does Extracted data looks like?
Sample Data set for Movie Name, Votes and Director is displayed here, Rest of the data is also stored in individual python list
**_[‘Bajirao Mastani’, ‘Queen’, ‘Bhaag Milkha Bhaag’, ‘Barfi!’, ‘Zindagi Na Milegi Dobara’]_**
**_[‘17,362’, ‘39,518’, ‘39,731’, ‘52,308’, ‘41,731’]_**
**_[‘Director: Sanjay Leela Bhansali’, ‘Director: Vikas Bahl’, ‘Director: Anurag Basu’]_**
Don’t see any co-relation between these data, if someone have to pick the release year and director for a movie then it’s difficult to get it from these lists, So let’s put the data in a Structured and more meaningful format which will make sense for someone looking at this data. So lets bind the data in Python dictionary
Python Dictionary
Structured Data
{
“Director”: “Director: Sanjay Leela Bhansali”,
“Votes”: “17,362”,
“RunTime”: “A historical … (158 mins.)”,
“Year”: 2015,
“Genre”: “Drama”,
“Movie Name”: “Bajirao Mastani”,
“Rating”: “7.2”
}
This Data in python dictionary(Key:value pair) looks good and make more sense now, However if you look carefully the data is not in correct format for data manipulation, Votes value contains comma, Director contains unwanted text “Director:” and Ratings and Runtime are not in correct data type. Lets Clean this data to bring it in shape for performing analysis
Data Cleansing
{
“Director”: “Sanjay Leela Bhansali”,
“Votes”: 17362,
“RunTime”: 158,
“Year”: 2015,
“Genre”: “Drama”,
“Movie Name”: “Bajirao Mastani”,
“Rating”: 7.2
}
Data in Pandas Dataframe
The entire movie data is stored in python dictionary but for doing further analysis this data needs to be consumed by Pandas Dataframe so that by using Pandas rich data structures and built-in function we can do some analysis on this data. Import data in Dataframe.
Records with Missing Values
There are some missing values in this data, But Pandas provides excellent feature to handle missing and null values. So for these 3 movies RunTime data is not available on the page. so for further analysis we will replace this missing data with the mean value of the available data for RunTimeColumn
Replace Missing Values with Mean
Movies with Highest Ratings
Top five movies since 1955
Ratings Trending Graph
Movies with Lowest Ratings
Movies with Maximum Run Time
Top 10 movies
RunTime Trending Graph
Average Movie RunTime
Movies IMDB Ratings
Movies with rating Greater than 7
Ratings Visualization using Bar Graph
Percentage distribution
Best Movies By Genre
Directors won more than Once
Conclusion:
Movies most likely to be selcted for Best Picture
- Rating greater than 7
- Run time more than 2hrs
- Category Drama and Musical