Hướng dẫn python word frequency

Ever wondered about a quick way to tell what a document is focusing on? What is its main topic? Let me give you this simple trick. List the unique words mentioned in the document, and then check how many times each word has been mentioned (frequency). This way would give you an indication of what the document is mainly about. But that would be a very boring, slow, and tiring task if done manually. We need an automated process, don't we?

Yes, an automated process will make this much easier. Let's see how we can list the different unique words in a text file and check the frequency of each word using Python.

1. Get the Test File

In this tutorial, we are going to use test.txt as our test file. Go ahead and download it, but don't open it! Let's make a small game. The text inside this test file is from one of my tutorials at Envato Tuts+. Based on the frequency of words, let's guess which of my tutorials this text was extracted from.

Let the game begin!

About Regular Expressions

Since we are going to apply a pattern in our game, we need to use regular expressions (regex). If "regular expressions" is a new term to you, this is a nice definition from Wikipedia:

A sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep, a filter

If you want to know more about regular expressions before moving ahead with this tutorial, you can see my other tutorial Regular Expressions In Python, and come back again to continue this tutorial.

2. Building the Program

Let's work step by step on building this game. The first thing we want to do is to store the text file in a string variable.

document_text = open('test.txt', 'r')
text_string = document_text.read()

Now, in order to make it easier to apply our regular expression, let's turn all the letters in our document into lowercase letters, using the lower() function, as follows:

text_string = document_text.read().lower()

Let's write our regular expression that will return all the words with a number of characters in the range [3-15]. Starting from 3 will help in avoiding words whose frequency we may not be interested in counting, like if, of, in, etc., and words longer than 15 might not be correct words. The regular expression for such a pattern looks like this:

\b[a-z]{3,15}\b

\b is related to the word boundary. For more information on the word boundary, you can check this tutorial.

The above regular expression can be written as follows:

match_pattern = re.search(r'\b[a-z]{3,15}\b', text_string)

Since we want to walk through multiple words in the document, we can use the findall function:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

At this point, we want to find the frequency of each word in the document. The suitable concept to use here is Python's Dictionaries, since we need key-value pairs, where key is the word, and the value represents the frequency with which words appeared in the document.

Assuming we have declared an empty dictionary frequency = { }, the above paragraph would look as follows:

for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1

We can now see our keys using:

frequency_list = frequency.keys()

Finally, in order to get the word and its frequency (the number of times it appeared in the text file), we can do the following:

for words in frequency_list:
    print(words, frequency[words])

Let's put the program together in the next section, and see what the output looks like.

3. Putting It All Together

Having discussed the program step by step, let's now see how the program looks:

import re

frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1
    
frequency_list = frequency.keys()

for words in frequency_list:
    print(words, frequency[words])

If you run the program, you should get something like the following: 

Hướng dẫn python word frequency
Hướng dẫn python word frequency
Hướng dẫn python word frequency

Let's come back to our game. Going through the word frequencies, what do you think the test file (with content from my other Python tutorial) was talking about?

(Hint: check the word with the maximum frequency).

4. Get the Most Frequent Words

In the above example, the list of unique words was fairly small due to a small text sample. So we could pick the most frequent word after glancing through the list relatively quickly.

What if the text sample is quite large? In that case, it would be much easier to get the most frequent words by simple sorting that is baked into our program. Here is some example code that gets the most frequently used words from an excerpt of Dracula.

import re

frequency = {}
document_text = open('dracula.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
 
for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1

most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True))

most_frequent_count = most_frequent.keys()
 
for words in most_frequent_count:
    print(words, most_frequent[words])

I got the following list of words after executing the program.

Hướng dẫn python word frequency
Hướng dẫn python word frequency
Hướng dẫn python word frequency

5. Exclude Specific Words From the Count

You can usually expect the most common word in any large piece of text to be the word "the". You can get rid of such unwanted filler words for a better analysis of the text by creating a blacklist and only adding words to your dictionary if they're not in the blacklist.

import re

frequency = {}
document_text = open('dracula.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

blacklisted = ['the', 'and', 'for', 'that', 'which']
 
for word in match_pattern:
    if word not in blacklisted:
        count = frequency.get(word,0)
        frequency[word] = count + 1

most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True))

most_frequent_count = most_frequent.keys()
 
for words in most_frequent_count:
    print(words, most_frequent[words])

Here is the output after running the above code on the same file.

Hướng dẫn python word frequency
Hướng dẫn python word frequency
Hướng dẫn python word frequency

Final Thoughts

In this tutorial, we learned how to get the frequency of words in a text sample by using a simple Python program. We also modified the original code to get a list of the most frequent words or only get words that are not in our blacklist. Hopefully, you will now be able to update the program according to your own individual needs to analyze any piece of text.

Did you find this post useful?

Hướng dẫn python word frequency

I like writing about Python.