Welcome everyone to this lecture on regular expressions.
Lets imagine that you needed to search a string for a term such as the string phone.
Well you can easily use the in operator keyword to do this.
You could just say is the string phone in this larger string such as is the phone here and it will return back.
True with Python.
Now imagine you had to find a telephone number such as 4 0 8 dash 555 Dash 1 2 3 4.
Again you could use the same sort of information you would say the string you were looking for in and then the larger string you wanted to search and that would return back true as well.
But heres the problem.
What if you actually dont know the exact phone number youre looking for you just know the general format of a phone number or if you want to find every phone number in a specific document or every e-mail address and a specific document you know that certain types of objects always hold the same format like dates e-mail addresses addresses or phone numbers.
What we need is a regular expression in order to actually search for the pattern of what were looking for instead of any specific sequence regular expressions allow for pattern searching in a text document.
Now the syntax for regular expressions can be really intimidating at first.
So here is the actual code for the regular expression of searching for a phone number.
And if you take a look at that its kind of unreadable if you dont know regular expressions.
Hopefully after this lecture youll be able to build an understanding of what that actually mean s.
So the key thing to keep in mind is that every character type has a corresponding pattern code.
For example digits have the placeholder pattern code of backslash D.
The use of backslash allows Python to understand that this is a special code and not the letter D.
You can then combine this with other special character codes in order to find any sort of pattern you can think of.
So if we take a look back at this example also is actually saying is fine three digits then a dash then another set of three digits then another dash and so forth.
Which is the format of a phone number.
So lets learn how to create our own regular expressions for finding patterns and text by hopping over to Jupiter notebook.
Lets begin by understanding how to use the actual regular expression library thats built into Python.
First Im going to make some text so we can search such as the phone number of the agent is 4 0 8 5 5 5 5 1 2 3 4 and Ill say call soon.
And then theres a period immediately after that phone number.
So what it could do is say hey is phone in text and that returns true.
I can check the strings inside of another string and if I happen to know the actual phone number Im looking for I could as he previously mentioned just run.
Is this number in here.
What I still dont know how to do for regular expressions is search for.
Is there any phone number with this particular format inside of this text.
So let me show you the format for actually running a regular expression and then well go over some Codder common pattern designs.
So were going to say import our e.
And thats the regular expression library thats built in the python.
No need to download anything for that one.
And then the pattern were going to be looking for well just say its the string phone and then we use the expression library as you say our search and then you pass and your pattern and the text you want to search and then you get back this match object.
And theres lots of different methods and attributes that the match object has.
So for example if I call my match and set that equal to a regular expression search of a particular pattern in my text I can then ask for information off match such as the spin and it goes from 4 tonight.
So what that actually mean s is starting an index for of this string.
So 0 1 2 3 4 we can say thats the key for this pattern.
Thats the phone.
And then it goes up to index 9.
So we can see here it actually located the span of that actual pattern.
And then we can ask for particular things.
And if you hit that tab youll see quite a lot of attributes but so the more common ones are asking for the start of the match or the end of the match.
Now youll notice is if the pattern occurs more than one time.
So if we say My phone is a new phone.
So I read the fine text there.
Lets go ahead and make another match and say our research for the pattern phone inside of this text if you ask for Spanne.
Unfortunately if you just say our esearch its only going to find the first instance which happens starting at index three.
Its not going to tell you that theres two of them.
Instead if you want to find all the matches instead of just the first instance the correct code is r e find.
And again it can pass and order pattern are looking for.
So again the pattern here is just the string phone.
And then we search the updated text and we can say all matches is equal to this.
And if you check out all matches its going to return back the strings.
But what you could do is check the length of this list and Ill tell you it has two matches.
Now if you actually want to find match objects and set it has a list of the matches which is a very useful I can say is r e find and then say whatever pattern youre looking for in our case its fun.
Whatever you want to search and then you can actually iterate for that.
So you could say for match an R E find her go ahead and prints the span of that particular match.
So this is essentially returning an iteration of all those match objects which were the specialized objects that we saw appear which had various attributes such as spanne start and end so commonly youre going to see us use our find all if we expect it to be more than one match in the actual documents.
But our esearch is going to return back the very first match which is good in case you want to know if it actually exists.
In that case you only care about one instance.
Now lets go over some patterns.
We learned how to search for a basic string but we actually want to search for generalized patterns not just specific strains because we may not know what specific string or sequence of letters were looking for.
Instead we just know a general pattern such as the general pattern of an e-mail address or a phone number.
And before we continue Id highly recommend the check out underneath the python text basics folder.
If you open up the regular expressions notebook theres tables in here that go over a lot of what were discussing because usually people dont memorize the patterns unless these regular expressions all the time.
If you scroll down weve provided easy to use tables that reference the specialized character the description and example pattern code and an example match.
And if you keep going down youll see more for quantifiers which is essentially different amounts and then another one for groups as well as a couple of examples here.
So the first were going to talk about and whole hold the little bit on this part of the notebook is underneath patterns.
Were now going to look for identifiers for characters and patterns.
So theres various characters and notice that theres a backslash on all of these basically telling Python Hey I dont mean the actual letter D.
I mean this specialized character D which D is for a digit.
And so you can see here an example pattern code could be f i l e underscore.
So Im looking for file underscore and then I just say backslash D-Backs lushly which essentially mean s any two digits.
So an example match here could be file underscore 25 or file underscore 43 or so on basically any string that starts a file underscore and then two numbers.
Hopefully you can see this already extremely useful.
We also have alphanumerics which is any alphabet letter or any numeric.
So in case youre looking for any series of letters and numbers you can use an alphanumeric.
Theres also whitespace theres a non digit.
So Cap capital D stands for a nod that is essentially a letter.
Theres also non-alphanumeric which is kind of just punctuation and then theres also not whitespace anything thats not whitespace.
So these are general characters and possible identifiers that youre going to be using.
So lets go over a few examples of actually running these in order to search for a generalized pattern.
So if we come back a peer to another book so far weve only been searching for strings that we know such as searching for a pattern.
But now lets go ahead and generalize this to search for something using these qualifiers so we right now if we say or text lets make sure were using the right one.
We have my telephone number is 7 7 7 dash 5 5 5 thats 1 2 3 4.
So I know Im looking for a phone number but I dont know what phone number of looking for.
So I need to say Im looking for three digits a dash in other set of three digits in other dash and then a set of four digits.
So the way I can do that is by defining my pattern they set that equal to and in order for Python to understand thats a specific pattern regular expression string.
You start off with our outside of these quotes and then you pass the pattern youre looking for.
So have backslash the three times then a dash backslash Diegan and other three times then another dash and then backslash D again four times.
So now we have a generalized pattern that looks for three digits or dash another set of three digits a dash and then four digits.
So I can say now phone number is equal to our search and then search for the pattern inside of that text and what this does is it returns back my actual matching object and it tells me where in the actual document it found it as well as the actual match it found.
And if you actually just want this information the actual match itself you can say group and it will group together everything that it found to be matching.
So in this case its this phone number here.
Keep in mind if I were to change my text so I dont say text is equal to is going to copy and paste this.
Lets go ahead and change this to be another set of numbers like a.
This expression is just looking for a generalized pattern.
So its actually still able to find it.
Again Im just using our search.
And if you wanted to find multiple phone numbers here you could say find all.
And it would be able to return back all the matches if you want to iterate through those matches.
You can use our thought find it.
The main idea is that utilizing these sort of tables you can then create any sort of pattern you can think of digits alphanumerics whitespace non digits.
Now because its really common to see a bunch of alphanumerics in a row or a bunch of digits in a row its kind of annoying to have to type backslash the a bunch of times.
So the next set of patterns were going to learn about are quantifiers.
So we know special character designations.
We can now use them lot of qualifiers to define how many we should expect.
So you put in your special character code and then curly braces.
How many times you want it to occur.
So this says any non digit capital-T that occurs exactly three times.
So something like ABC and then we can say occurs two to four times.
So you can give actual ranges by saying a number for the low range comma and the number for the higher range you can say a number comma and then no other number for something X or more times like three or more times.
And Asterix is not too common to use because it says occurs 0 or more times.
So essentially mean s its occurring either 0 times or more than 0.
So thats maybe a little too flexible for a lot of use cases.
And then a question mark is once or more.
OK so lets go ahead and convert that telephone number here using those quantifiers.
So Im looking for a group of three digits a dash three digits a dash and then four digits.
So good use case for this one as far as quantifiers is probably this pattern here a set of curly braces.
And I know Im looking for three three and four so Im going to redefine this to be the following.
I can basically efficiently write this pattern saying R and you can use double quotes or single quotes then say backslash D and instead of writing it three times.
Im just going to use my quantifier which is curly braces three to say Im looking for three digits a dash in other set of three digits in the dash and then a set of four digits.
And if we take a look again for this I can say are esearch for the pattern in the text and it was able to find the updated match.
So its usually more efficient to write something like this than to write something like this.
But if you were to jump straight into this refrig expressions you can see that regular expressions can easily become very difficult to read.
But keep in mind you can always just break it down.
The tables that we provide and you can always use that group in case you actually want to find the actual object or pattern that matched.
So I can say again my match here and you have the specialized match objects with a bunch of different attributes and parameters here that you can explore a really useful one is just grouping it together to find the actual match.
So thats the basics of identifiers and quantifiers.
Again I highly recommend you check out the tables that we provide for patterns.
You can essentially build any pattern you want using identifiers looking for specific types of characters like digits alphanumerics or white space or their reverse opposites non digits not alphanumerics and nonwhite space.
And you can combine those with quantifiers knowing exactly how many you want to expect or some sort of range that you expect.
Coming up next were to continue our discussion by talking about groups talking about additional syntax such as or operators wildcard characters and then starts with and ends with.
Well see at the next lecture where we talk about these other various regular expression topics.