One of my recent data science lab assignments was to choose a musical artist, scrape every one of his/her song lyrics from the web, and use the Markov Chain technique to generate new lyrics. The following program works by compiling a separate list for every word mentioned sung by Drake. Each list is a list of all words that have ever followed the word to which the list belongs. For example, “you” –> [“to”, “just”, “need”, “finish”, “boys”]. It then uses these connections and some random selection to generate new sentences/lines. I’ve used Drake as my test subject because I think his lyrical style is very recognizable, but you can run this script for any artist. All you need to do is to swap out the LyricsFreak URL. Note, the new URL must be identical in structure (i.e. .com/d/) or else the web scraping will not work.
import requests
import time
from bs4 import BeautifulSoup
import bs4
import time as t
import random
lyrics = []
links = []
songs = requests.get("http://www.lyricsfreak.com/d/drake/")
parser = BeautifulSoup(songs.text, "html.parser")
for song in parser.find_all("td", class_="colfirst"):
link = song.find("a")['href']
links.append("http://www.lyricsfreak.com" + link)
for link in links:
song = requests.get(link)
parser = BeautifulSoup(song.text, "html.parser")
t.sleep(0.1)
lines_dump = parser.find("div", class_="dn", id="content_h")
if lines_dump is not None:
lyrics.append( list(lines_dump.strings) )
def train_markov_chain(lyrics):
transitions = {"<START>": [],
"<END>": [],
"<N>": []}
for lyric in lyrics:
for lnum, line in enumerate(lyric):
chopped = line.split()
for wnum, word in enumerate(chopped):
if word not in transitions:
transitions[word] = []
if lnum == 0 and wnum == 0:
transitions["<START>"].append(word)
elif wnum == 0:
transitions["<N>"].append(word)
if lnum == len(lyric) - 1 and wnum == len(chopped) - 1:
transitions[word].append("<END>")
elif wnum == len(chopped) - 1:
transitions[word].append("<N>")
else:
transitions[word].append(chopped[wnum+1])
return transitions
chain = train_markov_chain(lyrics)
def generate_new_lyrics(chain):
# a list for storing the generated words
words = []
# generate the first word
words.append(random.choice(chain["<START>"]))
done = False
while done == False:
ondeck = random.choice(chain[words[-1]])
if ondeck == "<END>":
done = True
else:
words.append(ondeck)
# join the words together into a string with line breaks
lyrics = " ".join(words[:-1])
return "\n".join(lyrics.split("<N>"))
print(generate_new_lyrics(chain))
Here’s one of the songs the script spit back out:
She in it heals all, all, all, switch the ground off multi-platinum recordings Drinking watch is this is in here I see my pen up to know what she two up the rims on a little, why is in my day I got to who you strip for you just like the A&R; men never forget it poppinâ donât make me there is for me who's it go that you might just show up... Damn. ... on camera I can also raise one so well, this money til next to Texas back in, You ain't last season changed you gon' come back about her and screwed, I left your clothes We can tell the year 'round me and hotels that I mean one else's I need more (More) Twenty five o four, need you I don't know you wanna be, I say you need some dinner you hope you probably end getting right now that OvO that (Money) So much to check if you pass it anyways, Yeah, Tom Ford tuscan leather smelling like There ain't nothing I'mma do it... still fly though, I really have to the ceiling I give it start? The Tires Burnin I need it up in the morning Ever since I know I'm squeezin' in my thoughts of Four You Always gone Some Girls That I before I guess that's where you Somebody shoula told me. This Girl you bring us down and it's where it like oh-ah-oh-oh Tuck my whole city faded (the ride) My memories of baked ziti Let It Hurt Faces I see your friend No One Else Beat the past piss