What is a Keywords Dictionary?
A Keywords Dictionary is a set of words put together based on a common theme. Consider this example, you are managing a bank and you want to improve your customer service. You give your customers a feedback form? For you to make sense of what your customers complaint or talk about, you need to have a list of keywords that are related to the business and this is where the need to have a Keywords Dictionary comes in.
Since Keywords Dictionaries are specific to a business purpose, they might not be easily available unless someone has already built it and made it available to the public. Hence, it is usually better to build your own custom Keywords Dictionary.
How to build your own Keywords Dictionary?
- First of all, we need to define a Data Source - which is usually a website with a list of Keywords that we are interested in.
- Extract Website content from the given URL
- Scrape the desired content (Keywords) from the website content
- Clean the scraped data if required and store locally for future use
Now, you may be wondering about some new jargon introduced in one of the points - "scrape". It comes from the process called "Web Scraping" which is “a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis" as defined by Wikipedia.
Getting started with Web Scraping:
Python has two great packages for web scraping.
Both these packages are great for different reasons - BeautifulSoup has an elegant and consistent API that makes it very simple for a beginner to get started with Web Scraping. Scrapy does a lot of complex scraping tasks like extracting content after logging in or form submission that is not straightforward with BeautifulSoup. IT also comes with its own set of complexities which are handy in complex scraping pipeline for someone who’s already familiar with Web Scraping. Hence, We will use "BeautifulSoup" in this post to scrape data from the Web. While BeautifulSoup can do the job of parsing the html and making sense of the web content, we need to "get" the website in the first place and we will use the "requests" package for that.
How to install requests & BeautifulSoup:
Requests can be installed using pip.
pip install requests - if you are using Python version lesser than 3
pip3 install requests - if you are using Python version greater than 3
BeautifulSoup also can be installed using pip or pip3 if you are using Python 3.x.
pip install beautifulsoup4
pip3 install beautifulsoup4
Loading both the libraries:
from bs4 import BeautifulSoup
We will use Moneycontrol.com's Glossary page to build our Finance Keywords Dictionary. Note that this post is just for educational purposes and make sure you don't violate the Terms of Service of the websites from which you are trying to scrape.
url = "http://www.moneycontrol.com/glossary/"
As we have defined the url, now let us extract the content of the url.
content = requests.get(url) #sends a GET http request to collect the content
We can check if the request was successful by checking the response status.
content.status_code #200 is succssful
content_text = content.text #extracting the response content as text
Now, as the content is ready as text. We can use BeautifulSoup to make a "soup" - ideally, parsing the html
soup = BeautifulSoup(content_text)
As we have seen in the above screenshot, what we are interested in the extracted content is the html tag "a". But there are so many links in the website that also could include junk like social media links and other irrelevant links. Giving a deep look in the above screenshot could also reveal that our desired urls have a common pattern that is "/glossary/". Hence we would be extracting the content with two conditions:
- only "a" tag
- "a" tag with "href" containing the string "glossary" in it
To extract all the "a" tag links, we will use the function "find_all()" and to find the string "glossary" in "href", we will use "regex" for pattern matching using the python package "re".
all_links = soup.find_all("a", href = re.compile("glossary"))
Now, we are ready to extract the Keywords, which are nothing but the text values in each of those links that we extracted and stored in "all_links". We will use a "for" loop to iterate through each element of "all_links" and extract "text" value of it and store it in a list.
keywords =  #empty list to store the keywords
for link in all_links:
['Early (premature) Withdrawal',
'Early Retirement Penalty',
'Earned Income Rule',
'Earnings before taxes',
'Earnings multiple approach',
'Earnings per Share (EPS)',
'Economic double taxation',
That's it! We have successfully built a Finance Keywords Dictionary of length 3165. Please note that some of the keywords might need a little bit of cleaning up and business domain knowledge for further refinement before using in your Machine Learning model. This post could be easily be replicated based on your needs with a simple change of the source url and a few other tweaks. This code is also available as a Jupyter notebook on github.