Indian TV Serials: A Data Analysis Project
A multi-faceted data analysis project revolving around the Indian TV Industry by Subh Chaturvedi.
Last updated
A multi-faceted data analysis project revolving around the Indian TV Industry by Subh Chaturvedi.
Last updated
TV Serials and family dramas have a special place in every Indian’s heart. Nothing can ever replace the iconic “Dhum Ta Terenana” score that amplifies the tension in the air or the “Saas Bahu” dramatic tropes introduced into the Indian Entertainment Industry by these TV Serials.
From classics like “Saas Bhi Kabhi Bahu Thi” and “Sasural Simar Ka” to modern entries like “Shark Tank”, this industry and this culture is ever-evolving and uniquely creative.
Its only fitting then, that when I found a dataset about Hindi TV Serials, I immediately decided to do this analysis and draw some interesting insights from it.
Let us start with looking at the dataset I am going to be using for this analysis project. This dataset titled “Hindi TV Serials” contains almost 800 unique values with the name of the serial, its cast, its IMDB rating and an overview.
It contains all the TV Serials aired on the following channels from 1988 to the present day (May 2022):
Sab TV
Sony TV
Colors TV
StarPlus
Zee TV
Technically the dataset is distributed as a CSV file (181.76kB) and has 736 unique values spread of the following columns:
Name
Ratings
genres
overview
Year
Cast
Example Values from the Dataset
Kyunki Saas Bhi Kabhi Bahu Thi
1.6
"Comedy, Drama, Family"
A mother-in-law's struggle to put up with her three bahu's. The three bahu's have grown up sons. The bahu's sons start to get involved with having girlfriends and the bahu's try and break their relationships up.
2000–2008
"Smriti Malhotra-Irani ,Ronit Roy ,Amar Upadhyay ,Sudha Shivpuri"
Kahaani Ghar Ghar Kii
2.1
Drama
"The show explored the worlds of its protagonists Parvati Aggarwal and Om Aggarwal, who live in a joint family where by Parvati is an ideal daughter-in-law of Aggarwal family and Om the ideal son."
2000–2008
"Sakshi Tanwar ,Kiran Karmarkar ,Mita Vashisht ,Ali Asgar"
I will be analysing the relationships and the insights that each of the column provides when properly cleaned and arranged.
I start with importing the necessary modules for this project:
pandas
numpy
matplotlib
Then the dataset is imported into the environment through the read.csv
method.
The IMDB ratings are going to be very important throughout this analysis as a way to judge the quality and popularity of a TV Show whenever applicable.
But before we dive-in into how other parameters relate and affect the IMDB rating of a show, let us independently look at these ratings.
We use the sort_values()
function to get an output of the top shows according their IMDB ratings.
Output:
As is clearly discernible, the top 5 shows according to their ratings are:
Mitegi Laxman Rekha (9.7)
Shobha Somnath Ki (9.4)
Love U Zindagi (9.4)
Wagle Ki Duniya (9.2)
Jagannath Aur Purvi ki Dosti Anokhi (9.2)
Well I am not sure I agree with these results but well if you say so IMDB, if you say so...
Analyzing the cast column can provide some interesting statistics to look at, but there is a serious problem that limits us from using it to any useful extent.
The problem is the format in which these values are stored in the dataset.
For example take the value for the "Cast" column in the row for Shobha Somnath Ki
:
Ashnoor Kaur ,Tarun Khanna ,Joy Rattan Singh Mathur ,Sandeep Arora
This value is troublesome as it is stored as a single <str> type object and thus it is not possible to calculate or discern any data for individual cast members.
Thankfully, as elaborated by Max Hilsdorf in his Medium blog, the string object present in the cell can be converted into a list object, and subsequently into a one dimensional data type that can allow functions like value_counts()
and groupby()
to function.
But his solution does not apply to our problem without extensive modifications as the values we wish to convert to a list do not have any pre-existent list based syntax. Therefore we need to convert each cell in the Cast Column into a value based on list syntax ie. ["a","b","c",...]
.
We can implement this by writing a function the takes input in the format that we have and then adding the square brackets and the quotation marks and returning it in the format that we need. This is my implementation of such a function:
This function also takes care to properly handle and replace any disruptive data. I mainly encountered some FLOAT datatypes which threw errors as they could not be treated like strings.
After applying this function and the python eval()
function, we have the required list datatypes.
Before proceeding we also need to create the function needed to convert these 2D lists to 1D. For that we will use:
Now that we can use the Cast data properly, lets find out which artist has the best average IMDB ratings for the shows they worked in.
Output:
The artists with the best mean IMDB rating for his shows is Tushar Khanna. He has worked in "Pyaar Tune Kia Kya", "Piyaa Albela" and "Bekaboo".
This however does not necessarily reflect any superiority in acting or talent, but it may show (atleast to people who believe in it) some signs of luck an artist brings to a set.
Now moving to a more concrete relation. We will be finding out which actor has worked in the most TV shows. It should be noted that the values of this dataset only list the leading cast members in the cast section and thus artist with minor roles are not properly recognised in this analysis.
Output:
Ronit Roy having worked in 9 shows, comes out to be the most experienced artist in this dataset. No wonder I see him in every other serious father type role.
Its either comedy (the family kind) or drama (also the family kind) with Indian TV Serials. But don't take my word for it, let us see for ourselves the genre dynamics of Indian TV.
Genres also face the same problem as we faced above with artists. There is a small edit made to handle redundancies due to whitespace characters.
It is then used similary as the Cast solution.
First lets look at which genre claims the best mean IMDB ratings and garners the best critic response.
Output:
Humans do love war, huh.
Next lets look at which genre the creators love the most and thus create the most shows based around.
Instead of the text output, a visual representation of the output would be more suitable here, thus we generate a bar graph using the Series.plot()
function.
Output:
So THAT is why Indian households end up being so dramatic...
Shows like "Sarabhai vs Sarabhai" were definitely much ahead of their time. But lets look at how time affected the rest of the Indian TV.
To make use of the data in the Years column, we need to convert it into forms that are not haphazard and unusable like it originally is.
I created two new columns based on the Years column:
First Year: This column tracks the year in which the show started airing.
Years Run: This column tracks how long a show ran.
These columns were created with the following code:
The code was made to handle edge cases like wrong datatype and the weird "I XX" values in the Year column.
Which year was the busiest for the creators? We can use the following code to visualize the frequency of productions across years.
Output:
2017 brought us shows like "Naagin 2", "Yeh Rishta Kya Kehlata Hai" and "Yeh Hein Mohabbatein". In total it records the production of 59 shows compared to the runner up 2018 with 46 shows.
Indian shows like "Sasural Simar Ka" and "Kyunki Saas Bhi Kabhi Bahu Thi" are infamous for running long enough to be part of a late teenager's life since birth. So its obvious to find out which show actually has the longest runtime.
Output:
"C.I.D." is no-doubt part of every indian's life. With iconic characters like ACP Pradyuman, Abhijit, and Daya, and a premise revolving around crime in India, its not a surprise that it had a runtime of 20 years.
Herecomes the part I was most excited for. The written descriptions and overviews of these shows could surely provide me some very interesting insights that could have been the highlighs of this project.
Unfortunately after cleaning the data and writing the code to analyze it, it was shocking to see how useless the ordeal was. The data did was not sufficient and quality enough to let me draw any real conclusions from it.
But I will still show the method I used to clean and try analyzing the data.
Similar to the approach I took for the problems with other columns, I decided to convert the string based values to a list with every word being an element of the list. Also additionally the words were all turned to lowercase and any special characters were removed so as to make sure that redundancy was minimized.
The function was applied:
Now we have data that we can supposedly work on.
I planned to analyze multiple words like "love", "hate", "mother", "mother-in-law", "brother", etc. and their usage over time in the descriptions of TV Serials and even plot graphs showing interesting relations between the trends of different words.
This code gives the count of the words used grouped by years:
The following code could be used to plot the variance of occurance of words overtime, and also to show contrast in different words.
A visualization generated through this code (provided better data) could have looked like this:
This data could have led to a lot of other interesting analysis too, but unfortunately it was not possible.
We can still draw some simple insights from this data. Let us find out the 50 most used words in the descriptions for Indian TV Serials.
Output:
Some significant meaningful words come out to be "family", "love" and "life"... That is some Fast & Furious philosophy it seems.
Indian TV is definitely an interesting place to observe and analyze. This project aimed at looking at some of the angles of the vast possibilities that are present with proper datasets.
But the tip of the iceberg that we touched also gave us some interesting results:
Top 5 Indian TV Shows by IMDB Rating.
Artists with the best mean IMDB Rating.
Artists with the most experience.
Genre with the best mean IMDB Rating.
Genre with the most available content.
The release frequency of shows over the years.
The longest running shows.
Usage of certain words in the overviews of TV shows over time.
Most used words in TV Show descriptions.
This project also helped me cement my skills in data analysis, especially learning how to analyze a varied dataset in multi-faceted fashion.
I also gained experience cleaning data and how to treat list like values in cells and treat elements individually.
Thankyou to everyone who actually stuck with reading till here, it was very fun for me to work on this project.
You can also read this project over at DEV.to and also download it as a PDF.