NLP | Distributed Marking with Execnet — part 1

Learn examples from DataCamp guru:

Bigrams: Bigram — these are 2 consecutive words in a sentence. For example, “A boy is playing football” . Bigrams here:

 The boy Boy is Is playing Playing football 

Trigrams: Trigram — these are 3 consecutive words in a sentence. For the above example, the trigrams would be:

 The boy is Boy is playing Is playing football 

Of the above bigrams and trigrams, some are meaningful, while others are discarded, which gives no value for further processing … 
Let`s say we want to learn from the document the skills required to become a “data scientist”. Here, if we consider only symbols, then one word cannot convey the details properly. If we have a word like Machine Learning Developer, then the extracted word should be Machine Learning or Machine Learning Developer. Words simply “Machine”, “training” or “developer” will not work as expected.

Code — illustrating a detailed explanation of trigrams

# Importing libraries

import nltk

import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

import pandas as pd

  # Enter the file

txt1 = []

with open ( `C: Users DELL Desktop MachineLearning1.txt` ) as file :

txt1 = file . readlines ()

 
# Pre-processing

def remove_string_special_characters (s):

 

# removes special characters with & # 39; & # 39;

  stripped = re.sub ( `[^ a-zA-zs]` , ``, s)

stripped = re. sub ( `_` ,` `, stripped)

 

# Change any space by one space

stripped = re.sub ( ` s + ` , ` ` , stripped)

 

# Remove to start and end spaces

stripped = stripped.strip ()

if stripped! = ``:

return stripped.lower ()

 
# Remove stop words

stop_words = set (stopwords.words ( `english` ))

your_list = [ `skills` , ` ability` , `job` , `description` ]

for i, line in enumerate (txt1) :

txt1 [i] = `` . join ([x for  

x in nltk.word_tokenize (line) if  

  (x not in stop_words) and ( x not in your_list)])

 
# Retrieving trigrams

vectorizer = CountVectorizer (ngram_range = ( 3 , 3 ))

X1 = vectorizer.fit_transform (txt1) 

features = (vectorizer.get_feature_names ())

print ( "Features:" , features)

print "X1:" , X1.toarray ())

 
# Using TFIDF

vectorizer = TfidfVectorizer (ngram_range = ( 3 , 3 ) )

X2 = vectorizer.fit_transform ( txt1)

scores = (X2. toarray ())

print ( " Scores: " , scores)

  
# Getting the best rating functions  

sums = X2. sum (axis = 0 )

data1 = []

for col, term in enumerate (features):

  data1.append ((term, sums [ 0 , col]))

ranking = pd .DataFrame (data1, columns = [ `term` , `rank` ])

words = (ranking.sort_values ​​( ` rank` , ascending = False ))

print ( "Words head:" , words.head ( 7 ))

Exit:

 Features: [`10 experience working`,` 11 exposure implementing`, `able work minimal`,` accounts commerce added `,` analysis recognition face`, `analytics contextual image`,` analytics nlp ensemble`, `applying data science`,` bagging boosting text`, `beyond existing learn`,` boosting text analytics`, `building using logistics`, `building using supervised`,` classification facial expression`, `class ifier deep learning`, `commerce added advantage`,` complex engineering analysis`, `contextual image processing`,` creative projects work`, `data science problem`,` data science solutions`, `decisions report progress`,` deep learning analytics`, `deep learning framework`,` deep learning neural`, `demonstrated development role`,` demonstrated leadership role`, `description machine learning`,` detection tracking classification`, `development role machine`,` direction project less` , `domains essential position`,` domains like healthcare`, `ensemble classifier deep`,` existing learn quickly`, `experience object detection`,` experience working multiple`, `experienced technical personnel`,` expertise visualizing manipulating`, ` exposure implementing data`, `expression analysis recognition`,` extensively worked python`, `face iris finger`,` facial expression analysis`, `finance accounts commerce`,` forest bagging boosting`, `framework tensorflow keras`,` good oral written`, `guidance direction project`, `guidance make decisions`,` healthcare finance accounts`, `implementing data science`,` including provide guidance`, `innovative creative projects`,` iris finger gesture`, `job description machine`,` keras or pytorch` , `leadership role projects`,` learn quickly new`, `learning analytics contextual`,` learning framework tensorflow`, `learning neural networks`,` learning projects including`, `less experienced technical`,` like healthcare finance`, ` linear regression svm`, `logistics regression linear`,` machine learning developer`, `machine learning projects`,` make decisions report`, `manipulating big datasets`,` minimal guidance make`, `model building using`,` motivated able work`, `multiple domains like`,` must self motivated`, `new domains essential`,` nlp ensemble classifier`, `object detection tracking`,` oral written communication`, `perform complex engineering`,` problem solving proven` , `problem statements bring`,` proficiency deep learning`, `proficiency problem solv ing`, `project less experienced`,` projects including provide`, `projects work spare`,` proven perform complex`, `proven record working`,` provide guidance direction`, `quickly new domains`,` random forest bagging` , `recognition face iris`,` record working innovative`, `regression linear regression`,` regression svm random`, `role machine learning`,` role projects including`, `science problem statements`,` science solutions production`, ` self motivated able`, `solutions production environments`,` solving proven perform`, `spare time plus`,` statements bring insights`, `supervised unsupervised algorithms`,` svm random forest`, `tensorflow keras or`,` text analytics nlp`, `tracking classification facial`,` using logistics regression`, `using supervised unsupervised`,` visualizing manipulating big`, `work minimal guidance`,` work spare time`, `working innovative creative`,` working multiple domains` ] X1: [[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0 ] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0]] Scores: [[0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]] Words head: term rank 41 extensively worked python 1.000000 79 good oral written communication 0.707107 47 good oral written 0.707107 72 model building using 0.673502 27 description machine learning 0.577350 70 manipulating big datasets 0.577350 67 machine learning developer 0.577350 

Now, if we do this for bigrams, the initial part of the code will remain the same. Only part of the bigram formation will change. 
Code: Python code to implement bigrams

# Receiving bigrams

vectorizer = CountVectorizer (ngram_range = ( 2 , 2 ))

X1 = vectorizer.fit_transform (txt1) 

features = (vectorizer.get_feature_names ())

print ( "X1:" , X1.toarray ())

 
# Applying TFIDF
# You can still get the n-gram here

vectorizer = TfidfVectorizer (ngram_range = ( 2 , 2 ))

X2 = vectorizer.fit_transform (txt1)

scores = (X2.toarray ())

print ( "Scores:" , scores)

  
# Get the best rating functions

sums =   X2. sum (axis = 0 )

data1 = []

for col, term in enumerate (features):

data1.append ((term, sums [ 0 , col]))

ranking = pd.DataFrame (data1, columns = [ `term` , ` rank` ])

words = (ranking.sort_values ​​( `rank` , ascending = False ))

print ( "Words:" , words.head ( 7 ))

Exit:

 X1: [[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0 ] [0 0 0 ... 0 0 0]] Scores: [[0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]] Words: term rank 50 great interpersonal 1.000000 110 skills abilities 1.000000 23 deep learning 0.904954 72 machine learning 0.723725 21 data science 0.723724 128 worked python 0.707107 42 extensively worked 0.707107  

Similarly, we can get TF IDF scores for bigrams and trigrams according to our usage. This can help us get better results without having to process more data.

Learn examples from DataCamp guru: