Change language

Analyzing text in Python 3

| |

Patterns in written text are not the same for all authors or languages. This allows linguists to study language of origin or potential authorship of texts where these characteristics are not directly known, such as in the Federalist Papers of the American Revolution.

Purpose. In this case study we will look at properties of individual books in a collection of books from different authors and different languages. More specifically, we’ll look at the length of the books, the number of unique words, and how these attributes are grouped by language or authorship.

Source: Project Gutenberg — the oldest digital book library. Designed for the digitization and archiving of cultural works, it currently contains over 50,000 books, all previously published and now available electronically. Download some of these English and French books from here and Portuguese and German books from here for analysis. Collect all of these books in the Books folder with the English, French, German and Potug subfolders.

Word frequency in the text

So, we are going to create a function that will count the frequency of words in the text. We’ll look at some sample test text and later replace the sample text with the books text file we just downloaded. Since we are going to the word frequency of the reference, so the uppercase and lowercase letters are the same. We will convert all text to lowercase and keep it.

text = "This is my test text. We’re keeping this text short to keep things manageable."

text = text.lower ()

Word frequency can be counted in various ways. We’re going to write the code in two ways (for information only). One of them uses a for loop and the other — counter from collections, which turns out to be faster than the previous one. The function will return a dictionary of unique words and their frequency as a key-value pair. So we are coding:

from collections import Counter

 

def count_words (text):  # counts word frequency

skips = [ "." , "," , ":" , ";" , & quot; & # 39; & quot ;, & quot; & quot; & # 39;]

for hours in gaps:

text = text.replace (ch, & quot; & quot;)

word_counts = {}

for a word in text.split (& quot; & quot;):

if word in word_counts:

word_counts [word] + = 1

more:

  word_counts [word] = 1

return word_counts

 

#""" count_words (text) You can check the function

 

 
def count_words_fast (text): # counts word frequency using counter from collections

text = text.lower ()

skips = [& quot;. & quot ;, & quot;, & quot ;, & quot;: & quot ;, & quot ;; & quot ;, & quot; & # 39; & quot; , ’" ’]

  for ch in skips:

text = text.replace (ch, "")

word_counts = Counter (text.split ( " " ))

return word_counts

 

# & gt ;"" count_words_fast (text) You can check the function

Output The: output is a dictionary containing unique words of the sample text as the key and the frequency of each word as the value. Comparing the output of both functions, we have:

{’were’: 1, ’is’: 1,’ manageable ’: 1,’ to ’: 1,’ things’: 1, ’ keeping ’: 1,’ my ’: 1,’ test ’: 1,’ text ’: 2,’ keep ’: 1,’ short ’: 1,’ this’: 2}

Counter ( {’text’: 2, ’this’: 2,’ were ’: 1,’ is’: 1, ’manageable’: 1, ’to’: 1, ’things’: 1,’ keeping ’: 1,’ my ’: 1,’ test ’: 1,’ keep ’: 1,’ short ’: 1})

Reading books in Python: since since we have successfully tested our word frequency functions with sample text. Now we are going to text functions with the books that we have loaded as a text file. We’re going to create a function called read_book () that will read our Python books, store it as a long string in a variable, and return it. The function parameter will be the location of the book.txt file that will be read and passed when the function is called.

def read_book (title_path):  # read the book and return it as a string

  with open (title_path, "r" , encoding = "utf8" ) as current_file:

text = current_file.read ()

text = t ext.replace ( "" , " "). replace (" "," ")

return text

Total Unique Words: we’re going to develop another function called word_stats () that will accept word frequency dictionary (output count_words_fast () / count_words ()) as parameter. The function will return the total number of unique words (sum / total keys in the word frequency dictionary) and dict_values ​​storing the total as a tuple.

def word_stats (word_counts):  # word_counts = count_words_fast (text)

num_unique = len (word_counts)

counts = word_counts.values ​​()

  return (num_unique, counts)

Function calls: So finally we’re going to read a book like — English version of Romeo and Juliet, and collect information on word frequency, unique words, total number of unique words, etc. from functions.

text = read_book ( " ./ Books / English / shakespeare / Romeo and Juliet.txt " )

  

word_counts = count_words_fast (text) 

(num_unique, counts) = word_stats (word_counts)

print (num_unique, sum (counts)) 

 Output: 511 8 40776 

With the help of the functions we created, we learned that in the English version of Romeo and Juliet there are 5118 unique words, and the sum of the frequencies of unique words is up to 40776. We can know which word occurs most often. in the book & amp; you can play with different versions of books, in different languages, to find out about them and their statistics using the above functions.

Building the characteristics of books

We’re going to plot: (i) Book length and unique word count for all books in different languages ​​using matplotlib. We import pandas to create

import os

import pandas as pd

 

book_dir = "./Books"

os.listdir (book_dir)

 

stats = pd.DataFrame (columns = ( "language" , " author " , " title " , "length" , "unique" ))

# check""" statistics

title_num = 1

for language in os.listdir (book_dir):

for author in os.listdir (book_dir + "/" + language):

for title in os.listdir (book_dir + " / " + language + " / " + author):

inputfile = book_dir + "/" + language + "/" + author + "/" + title

  print (inputfile)

  text = read_book (inputfile)

(num_unique, counts) = word_stats (cou nt_words_fast (text))

stats.loc [title_num] = language, author.capitalize (), title.replace ( ". txt" , ""), 

sum (counts), num_unique

title_num + = 1

import matplotlib.pyplot as plt

plt.plot (stats.length, stats .unique, "bo-" )

 

plt.loglog (stats.length, stats.unique, " ro " )

 

stats [stats.language = = " English " ] # check information in English books

 

plt.figure (figsize = ( 10 , 10 ))

subset = stats [stats.language = = "English" ]

plt.loglog (subset.length, subset.unique, "o" , label = "English" , color = "crimson " )

subset = stats [stats.language = = " French " ]

plt.loglog (subset.length, subset.unique, "o" , label = " French " , color = " forestgreen " )

subset = stats [stats.language = = " German " ]

plt.loglog (subset.length, subset.unique, "o" , label = "German" , color = " orange " )

subset = stats [stats.language = = "Portuguese" ]

plt.loglog (subset.length, subset.unique, "o" , label = "Portuguese" , color = "blueviolet" )

plt.legend ()

plt .xlabel ( "Book Length" )

plt.ylabel ( "Number of Unique words" )

plt.savefig ( "fig.pdf" )

plt.show ()

Conclusion: we have drawn two graphs, the first of which represents each book in a different language and the author as just a book. The red dots in the first chart represent one book and they are linked by blue lines. The log plot creates individual points [red here] and the line plot creates line curves [blue here] connecting the points. The second graph is a logarithmic graph,in which books of different languages ​​with different colors (red for English, green for French, etc.) are represented as discrete dots.
These graphs help you visually analyze facts about different books of striking origin. We learned from the graph that books in Portuguese are longer and contain more unique words than books in German or English. Plotting such data turns out to be very useful. help linguists.



Link:

This article courtesy of Amartya Ranjan Saikia . If you are as Python.Engineering and would like to contribute, you can also write an article using contribute.python.engineering or by posting an article contribute @ python.engineering. See my article appearing on the Python.Engineering homepage and help other geeks.

Please write in comments if you find anything wrong or if you’d like to share more information on the topic discussed above.

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

psycopg2: insert multiple rows with one query

12 answers

NUMPYNUMPY

How to convert Nonetype to int or string?

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Javascript Error: IPython is not defined in JupyterLab

12 answers

News


Wiki

Python OpenCV | cv2.putText () method

numpy.arctan2 () in Python

Python | os.path.realpath () method

Python OpenCV | cv2.circle () method

Python OpenCV cv2.cvtColor () method

Python - Move item to the end of the list

time.perf_counter () function in Python

Check if one list is a subset of another in Python

Python os.path.join () method