Working with string manipulation and basic file io operations in Python

- March 10, 2018

Last night, this tweet appeared in my timeline and caught my attention.

The word "friends" is said in every episode of Friends.
— Fact (@Fact) March 9, 2018

This doesn't sound right. There was no possible way to verify this and I expressed my concern over the same. A friend suggested counting the word in subtitle file for all episodes. I liked the idea, and a small weekend project was born.

Before I give the spoilers away on whether or not it is true, I want to first clarify a few different things on how I got it done. I didn't have the subtitles file so I had to download them all before parsing the word "friends". I began by looking at websites that would let me download the english subtitle files the easiest way possible.

Originally I had planned to use urllib with BeautifulSoup to download the files but TV Subtitles made it really easy for me(kudos to the site developers btw). They let you download seasonwise bulk zip files of all episodes according to the language of your choice, and all this through a very elegant html URL. After tinkering around with the urls for a while, I created a loop to build string URLs of all ten seasons. I checked if my urls worked using requests. 200 status code. Seems okay. I used this package called tqdm for a console progress bar while the files download one-by-one.

Next, I needed to extract these zip files to get access to each subtitle file. That turned out really easy with zipfile. However, the website bundles multiple subtitles for each episodes to sync with different video file formats so that created a slight issue, as I was going to get duplicate readings. So I snooped around a little bit to see the best way I could remove duplicates. I found a code snippet that generated filehash and removed files with duplicate hashes but that wasn't going to help me with different files. So I used the filename and extracted the season and episode numbers from each file. This created a unique ID for each file and I deleted the ones with duplicated IDs. So far so good.

Finally, I used this package called glob to import subtitle files into my runtime. Subtitle files usually come in two formats, .sub and .srt so I had to pass filetype as parameter and call the same function twice. I used glob to get one file at once and read through the file to count the number of times Friends or Friend(all cases included of course) appeared in the file. If the count returned zero, the word was not said in that episode.

All done. I sat back, ran my code and awaited a while. Voila. Atleast ten episodes popped up in my terminal with zero Friends word count.
I proved a twitter account with 1.8m followers wrong.