A step-by-step presentation of web-scrapping and EDA
In the below analysis I’m going to analyze a dataset about laptops available on Flipkart.com. This analysis will help the users understand various features of the laptop and will help them to choose the best product in their budget.
First of all, we are going to scrap the data from Flipkart.com then parse the data and store it in a usable format using BeautifulSoup. In the next step, the data will be analyzed and further cleaned-up using Pandas for analysis purposes. In the third and final stage, the sample data will be analyzed using visualization tools such as Matplotlib and Seaborn. The visualization graphs will help a user buy a laptop with some particular features in a specified budget.
About the Dataset:
The dataset consists of eight columns. There are five columns about features of laptops, one column about the product name, one column about the price of the product, and the last column is about ratings of the product. The dataset consists of 120 rows. This is a fair sample randomly picked up from the first seven pages of Flipkart.com.
What is EDA?
Exploratory Data Analysis or EDA is a process of analyzing data based on various statistical techniques which result in achieving hidden patterns and insights out of the dataset. In my below analysis, I shall use various visualization methods to represent meaningful comparison and evaluation of different features of laptops available in the market.
About the process:
I have segregated my work into three major processes. In the first process, I shall scrap and store the data, in the second process I shall clean the data and in the third and final process, I shall create comparisons and visualizations using different graphs.
First of all, we are going to install the fake_useragent library which will help access the server as a human user. We can render more pages and contents from the server with the help of fake_useragent.
Importing all the necessary libraries in Python:
Creating empty lists to store the data:
I have created empty lists for each of the features required out of the scrapped data for further analysis.
Scrap and parse the data:
I have put the script in a nested loop that corresponds to the page structure of Flipkart.com. The first section of the script will render the first seven pages of Flipkart.com, parse the rendered pages through BeautifulSoup and create a BeautifulSoup object.
The next section of the script will search required data and store the data in different variables. The last section of the script will fetch required data from a ‘list’ of items and store data in different variables.
Next, we are appending the empty lists with the data stored in different variables. After the end of the loop, the execution of the script jumps to updating the data frame, and the list of data is sorted according to the order provided in the data frame. The print statement provides us a count of the number of rows fetched and data stored in our empty list.
Here is the first glimpse of scrapped data:
Here is the first glimpse of our ordered/desired data. We shall further process the data in the next few steps to fine-tune the data according to our requirements.
In this stage, we are going to clean data with the help of Pandas library. I shall inspect every aspect of the data according to the requirements. Let’s check whether any data is wrongly positioned in different columns than the specified one.
Check wrongly positioned data:
We can see that the data in row numbers 64 and 95 are wrongly positioned and we need to remove those rows which otherwise will be misleading in our representation.
Drop rows with wrongly positioned data:
Here we are dropping all those wrongly positioned rows of data.
Cleaning unnecessary characters using regex:
The ‘Price’ column contains the Rupee symbol and comma separator. I have removed the Rupee symbol and comma using Regex.
Changing data types Price and Ratings column to float:
While collecting the data, all the data are stored as strings in our list. Now, I have changed the data type in the Price and Ratings column so that we can perform analysis based on numerical data.
Checking null values and NaN values:
The above images show that there are null values in our dataset. In the next cell, we are going to remove all null values and NaN values. To make the dataset more meaningful I have removed rows containing NaN values in the next step.
Saving cleaned data in CSV format:
Here we are saving the cleaned and processed data in a CSV format.
Final check on the Dataset:
Now, the dataset looks all good and ready for further analysis.
Information about the dataset:
The images above describe our dataset. Now, I shall proceed to make different visualizations based on the cleaned and processed data.
Histogram showing price and ratings:
The above histogram for price shows that most laptops available are between 40k and 70k price range according to the market demand. The mean price of all available laptops is INR 61,707.
Similarly, the histogram for ratings shows that most of the ratings received from users are between 4.2 and 4.5. The minimum rating received is 3.6 and the maximum rating received is 5.
KDE plots showing price and ratings:
Above KDE plots are representing similar observations of data. The depth of data is highest around 50k in the price plot and the depth of data is highest around 4.3 in ratings plot.
Joint plot showing the distribution of ratings over price category of laptops:
The above joint plot demonstrates the spread of ratings over the price range of laptops. A close look at the trend of ratings shows that most of the ratings received are for laptops between the 30k and 80k price range. This implies that most purchased laptops are between the 30k and 80k price range and are good value for money. This gives a message that a user should not go beyond the 80k price range of laptops unless they have a particular reason.
Comparison of various features of laptops Vs. Price
The above boxplots suggest different features of laptops in a particular price range. The same price range in Y-axis in all four plots helps a user go through different price segments vs. different features of laptops.
CPU Vs. Price plot clearly shows that a range of options are available between 45k and 70k in processor types. Similarly, RAM Vs. price plot suggests that most RAM types are available between 45k and 80k. HD Capacity Vs. Price and Display plots Vs. Price also suggest similar trends.
On the other hand, if a user has a lower or higher budget than the average they can choose features available as shown in the above plots.
A brief idea about Operating system Vs. Price.
The above bar graph shows that the operating system does not determine the price trend. All the laptops having minimum and maximum price are available with Windows 10 operating system.
The scrapped data is a good sample and randomly collected from Flipkart.com. The above analysis is fully reliable and gives real-time insight into the laptops available in the market. The pictorial representation of data gives a detailed view and a clear idea to a user in any market segment.
I have scrapped Flipkart.com intending to complete the project as part of my Data-Science course at Datafolkz.co.in