Change language

Analyzing text in Python 3

| |

Patterns in written text are not the same for all authors or languages. This allows linguists to study language of origin or potential authorship of texts where these characteristics are not directly known, such as in the Federalist Papers of the American Revolution.

Purpose. In this case study we will look at properties of individual books in a collection of books from different authors and different languages. More specifically, we’ll look at the length of the books, the number of unique words, and how these attributes are grouped by language or authorship.

Source: Project Gutenberg — the oldest digital book library. Designed for the digitization and archiving of cultural works, it currently contains over 50,000 books, all previously published and now available electronically. Download some of these English and French books from here and Portuguese and German books from here for analysis. Collect all of these books in the Books folder with the English, French, German and Potug subfolders.

Word frequency in the text

So, we are going to create a function that will count the frequency of words in the text. We’ll look at some sample test text and later replace the sample text with the books text file we just downloaded. Since we are going to the word frequency of the reference, so the uppercase and lowercase letters are the same. We will convert all text to lowercase and keep it.

text = "This is my test text. We’re keeping this text short to keep things manageable."

text = text.lower ()

Word frequency can be counted in various ways. We’re going to write the code in two ways (for information only). One of them uses a for loop and the other — counter from collections, which turns out to be faster than the previous one. The function will return a dictionary of unique words and their frequency as a key-value pair. So we are coding:

from collections import Counter


def count_words (text):  # counts word frequency

skips = [ "." , "," , ":" , ";" , & quot; & # 39; & quot ;, & quot; & quot; & # 39;]

for hours in gaps:

text = text.replace (ch, & quot; & quot;)

word_counts = {}

for a word in text.split (& quot; & quot;):

if word in word_counts:

word_counts [word] + = 1


  word_counts [word] = 1

return word_counts


#""" count_words (text) You can check the function


def count_words_fast (text): # counts word frequency using counter from collections

text = text.lower ()

skips = [& quot;. & quot ;, & quot;, & quot ;, & quot;: & quot ;, & quot ;; & quot ;, & quot; & # 39; & quot; , ’" ’]

  for ch in skips:

text = text.replace (ch, "")

word_counts = Counter (text.split ( " " ))

return word_counts


# & gt ;"" count_words_fast (text) You can check the function

Output The: output is a dictionary containing unique words of the sample text as the key and the frequency of each word as the value. Comparing the output of both functions, we have:

{’were’: 1, ’is’: 1,’ manageable ’: 1,’ to ’: 1,’ things’: 1, ’ keeping ’: 1,’ my ’: 1,’ test ’: 1,’ text ’: 2,’ keep ’: 1,’ short ’: 1,’ this’: 2}

Counter ( {’text’: 2, ’this’: 2,’ were ’: 1,’ is’: 1, ’manageable’: 1, ’to’: 1, ’things’: 1,’ keeping ’: 1,’ my ’: 1,’ test ’: 1,’ keep ’: 1,’ short ’: 1})

Reading books in Python: since since we have successfully tested our word frequency functions with sample text. Now we are going to text functions with the books that we have loaded as a text file. We’re going to create a function called read_book () that will read our Python books, store it as a long string in a variable, and return it. The function parameter will be the location of the book.txt file that will be read and passed when the function is called.

def read_book (title_path):  # read the book and return it as a string

  with open (title_path, "r" , encoding = "utf8" ) as current_file:

text = ()

text = t ext.replace ( "" , " "). replace (" "," ")

return text

Total Unique Words: we’re going to develop another function called word_stats () that will accept word frequency dictionary (output count_words_fast () / count_words ()) as parameter. The function will return the total number of unique words (sum / total keys in the word frequency dictionary) and dict_values ​​storing the total as a tuple.

def word_stats (word_counts):  # word_counts = count_words_fast (text)

num_unique = len (word_counts)

counts = word_counts.values ​​()

  return (num_unique, counts)

Function calls: So finally we’re going to read a book like — English version of Romeo and Juliet, and collect information on word frequency, unique words, total number of unique words, etc. from functions.

text = read_book ( " ./ Books / English / shakespeare / Romeo and Juliet.txt " )


word_counts = count_words_fast (text) 

(num_unique, counts) = word_stats (word_counts)

print (num_unique, sum (counts)) 

 Output: 511 8 40776 

With the help of the functions we created, we learned that in the English version of Romeo and Juliet there are 5118 unique words, and the sum of the frequencies of unique words is up to 40776. We can know which word occurs most often. in the book & amp; you can play with different versions of books, in different languages, to find out about them and their statistics using the above functions.

Building the characteristics of books

We’re going to plot: (i) Book length and unique word count for all books in different languages ​​using matplotlib. We import pandas to create

import os

import pandas as pd


book_dir = "./Books"

os.listdir (book_dir)


stats = pd.DataFrame (columns = ( "language" , " author " , " title " , "length" , "unique" ))

# check""" statistics

title_num = 1

for language in os.listdir (book_dir):

for author in os.listdir (book_dir + "/" + language):

for title in os.listdir (book_dir + " / " + language + " / " + author):

inputfile = book_dir + "/" + language + "/" + author + "/" + title

  print (inputfile)

  text = read_book (inputfile)

(num_unique, counts) = word_stats (cou nt_words_fast (text))

stats.loc [title_num] = language, author.capitalize (), title.replace ( ". txt" , ""), 

sum (counts), num_unique

title_num + = 1

import matplotlib.pyplot as plt

plt.plot (stats.length, stats .unique, "bo-" )


plt.loglog (stats.length, stats.unique, " ro " )


stats [stats.language = = " English " ] # check information in English books


plt.figure (figsize = ( 10 , 10 ))

subset = stats [stats.language = = "English" ]

plt.loglog (subset.length, subset.unique, "o" , label = "English" , color = "crimson " )

subset = stats [stats.language = = " French " ]

plt.loglog (subset.length, subset.unique, "o" , label = " French " , color = " forestgreen " )

subset = stats [stats.language = = " German " ]

plt.loglog (subset.length, subset.unique, "o" , label = "German" , color = " orange " )

subset = stats [stats.language = = "Portuguese" ]

plt.loglog (subset.length, subset.unique, "o" , label = "Portuguese" , color = "blueviolet" )

plt.legend ()

plt .xlabel ( "Book Length" )

plt.ylabel ( "Number of Unique words" )

plt.savefig ( "fig.pdf" ) ()

Conclusion: we have drawn two graphs, the first of which represents each book in a different language and the author as just a book. The red dots in the first chart represent one book and they are linked by blue lines. The log plot creates individual points [red here] and the line plot creates line curves [blue here] connecting the points. The second graph is a logarithmic graph,in which books of different languages ​​with different colors (red for English, green for French, etc.) are represented as discrete dots.
These graphs help you visually analyze facts about different books of striking origin. We learned from the graph that books in Portuguese are longer and contain more unique words than books in German or English. Plotting such data turns out to be very useful. help linguists.


This article courtesy of Amartya Ranjan Saikia . If you are as Python.Engineering and would like to contribute, you can also write an article using or by posting an article contribute @ See my article appearing on the Python.Engineering homepage and help other geeks.

Please write in comments if you find anything wrong or if you’d like to share more information on the topic discussed above.


Learn programming in R: courses


Best Python online courses for 2022


Best laptop for Fortnite


Best laptop for Excel


Best laptop for Solidworks


Best laptop for Roblox


Best computer for crypto mining


Best laptop for Sims 4


Latest questions


Common xlabel/ylabel for matplotlib subplots

12 answers


How to specify multiple return types using type-hints

12 answers


Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers


Flake8: Ignore specific warning for entire file

12 answers


glob exclude pattern

12 answers


How to avoid HTTP error 429 (Too Many Requests) python

12 answers


Python CSV error: line contains NULL byte

12 answers


csv.Error: iterator should return strings, not bytes

12 answers



Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python


How to specify multiple return types using type-hints


Printing words vertically in Python


Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries


Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically