+

Pattern matching in Python with Regex

You may be familiar with finding text by pressing Ctrl-F and entering the words you are looking for. Regular expressions go a step further: they allow you to specify a pattern of text to search for.
Regular expressions, called regular expressions for short, are descriptions of the structure of text. For example, a / d in a regular expression denotes a digit, that is, any digit from 0 to 9.

  • After regular expressions, used in Python to match a string of three digits, a hyphen, three more digits, another hyphen, and four numbers.
     Any other string would not match the pattern. ddd-ddd-dddd 
  • Regular expressions can be much more complex. For example, adding 3 in curly braces ({3}) after the pattern — it`s like saying, "Match this pattern three times." So the slightly shorter regex is
     d {3} -d {3} -d {4} 

    (This matches the phone number format is correct.)

  • Create Regex Object

    All regular expression functions in Python are in the re module

     import re 

    To create a Regex object that matches the phone number pattern, enter the following into the interactive shell.

     phoneNumRegex = re.compile (r`ddd-ddd-dddd`) 

    Now the variable phoneNumRegex contains a Regex object.

    Matching regular expressions

    The search () method of a Regex object looks for the string it is passed in for any matches with the regular expression. Match objects have a group () method that returns the actual matched text from the search string.

    # Python program for illustration
    # Matching object regular expressions

    import re

    phoneNumRegex = re. compile (r ` ddd-ddd-dddd` )

    mo = phoneNumRegex.search ( `My number is 415-555-4242.` )

    print ( `Phone number found : ` + mo.group ())

    Output:

     Phone number found: 415-555-4242 

    Regular Expression Matching Steps

    While there are several steps in Python for using regular expressions, each step is pretty simple.

    1. Import the regex module using import re .
    2. Create a Regex object using the re.compile () function. (Remember to use the raw string.)
    3. Pass the string you want to find to the search () method of the Regex object. This returns a Match object.
    4. Call the group () method of the Match object to return a string of actual matched text.
    5. Grouping with parentheses

      1. Matching Objects: Let`s say you want to separate the area code from the rest of the phone number. Adding parentheses will create regex groups: (/ d / d / d) — (/ d / d / d- / d / d / d / d). You can then use the group () object`s matching method to get the matching text from only one group.

        # Python program for illustration
        # Matching object regular expressions
        # with grouping

        import re

        phoneNumRegex = re. compile (r `( ddd) - (ddd-dddd) ` )

        mo = phoneNumRegex.search ( `My number is 415-555-4242.` )

        print (mo.group ( 1 ))

        OUTPUT:

         `415` 
      2. Get all groups at once. If you want to get all groups at the same time, use groups (), the — write down the name in the plural.

        # Python program for illustration
        # Matching object regular expressions
        # with groups

        import re

        phoneNumRegex = re. compile (r `(ddd) - (ddd-dddd)` )

        mo = phoneNumRegex.search ( ` My number is 415-555-4242.` )

        print (mo.groups ())

        YOU STROKE:

         (`415`,` 555-4242`) 
      3. Using mo.groups: mo.groups () will return a tuple from multiple values, you can use the multiple assignment trick to assign each value to a separate variable, as shown in the next line of areaCode, mainNumber = mo.groups ().

        # Python program for illustration
        # Matching regular expressions of objects
        # with mo.groups ()

        import re

        phoneNumRegex = re. compile (r `(ddd) - (ddd-dddd)` )

        mo = phoneNumRegex.search ( ` My number is 415-555-4242.` )

        areaCode, mainNumber = mo.groups ()

        print (mainNumber)

        OUTPUT:

         `555-4242` 
      4. Match parentheses: The parentheses have a special meaning in regular expressions, but what if you need to match a parenthesis in the text. For example, perhaps the phone numbers you are trying to match have an area code in parentheses. In this case, you need to escape the backslash characters (and). Enter the following into the interactive shell:

        # Python program for illustration
        # Matching object regular expressions
        # with grouping

        import re

        phoneNumRegex = re. compile (r `((ddd)) (ddd-dddd)` )

        mo = phoneNumRegex.search ( `My phone number is (415) 555-4242.` )

        print (mo.group ( 1 ))

        OUTPUT:

         `(415)` 

        The escape characters / (and /) in the raw string passed to re.compile () will match the actual characters in parentheses.

      Mapping multiple groups to a channel

      | the character is called a trumpet. You can use it anywhere to match one of many expressions. For example, the regular expression r & # 39; Batman | Tina Fey "will match either" Batman "or" Tina Fey ".

      When both Batman and Tina Fey occur in the search string, the first occurrence of the text match will be returned as a Match object. Enter the following into the interactive shell:

      # Python program for illustration
      # Matching object regular expressions
      # with multiple groups with a pipe

      import re

      heroRegex = re. compile (r `Batman | Tina Fey` )

      mo1 = heroRegex.search ( `Batman and Tina Fey.` )

      print (mo1.group ())

      OUTPUT:

       `Batman` 

      Matching specific repetitions with curly braces

      If you have a group that you want to repeat a certain number of times, follow the group in your regex with a number in curly braces. For example, the regex (Ha) {3} will match the string & # 39; HaHaHa & # 39;, but will not match & # 39; HaHa & # 39; since the latter only has two repeats of the group (Ha).

      Instead of a single number, you can specify a range by writing the minimum, separated by commas, and the maximum between curly braces. For example, the regular expression (Ha) {3, 5} will match & # 39; HaHaHa & # 39 ;, & # 39; HaHaHaHa & # 39; and & # 39; HaHaHaHaHa & # 39 ;.

      You can also omit the first or second number in curly braces to leave the minimum or maximum value unlimited. For example, (Ha) {3,} will match three or more instances of the group (Ha), and (Ha) {, 5} will match from zero to five instances. With curly braces, your regular expressions can be shorter. These two regular expressions match the same patterns:

       (Ha) {3} (Ha) (Ha) (Ha) 

      And these two regular expressions also match the same patterns:

       (Ha) {3, 5} ((Ha) (Ha) (Ha)) | ((Ha) (Ha) (Ha) (Ha)) | ((Ha) (Ha) (Ha) ( Ha) (Ha)) 

      Enter the following into the interactive shell:

      # Python program for illustration
      # Matches specific repetitions
      # with curly braces

      import re

      haRegex = re. compile (r `(Ha) {3}` )

      mo1 = haRegex.search ( ` HaHaHa` )

      print (mo1.group ())

      OUTPUT:

       `HaHaHa` 

      # Python program for illustration
      # Match specific repetitions
      # with curly braces

      import re

      haRegex = re. compile (r `(Ha) {3}` )

      mo2 = haRegex.search ( `Ha` ) = = None

      print (mo2)

      OUTPUT:

       True 

      Here (Ha) {3} matches Ha-Ha-Ha, but not Ha. Since it doesn`t match & # 39; Ha & # 39;, search () returns None.

      Optional question mark match

      Sometimes there is a pattern that you only want to match at will. That is, the regex must match whether or not that bit of text is present. The ? symbol marks the group that precedes it as an optional part of the pattern. For example, enter the following into the interactive shell:

      # Python program for illustration
      # optional match
      # with question mark (?)

      import re

      batRegex = re. compile (r `Bat (wo)? man` )

      mo1 = batRegex.search ( `The Adventures of Batman` )

      print (mo1.group ())

      YOU STROKE:

       `Batman` 

      # Python program for illustration
      # optional match
      # with question mark (?)

      import re

      batRegex = re. compile (r `Bat (wo )? man` )

      mo2 = batRegex.search ( `The Adventures of Batwoman` )

      print (mo2.group ())

      OUTPUT:

       `Batwo man` 

      (Wo)? The regexp part means that the wo pattern is an optional group. The regex will match text that contains zero or one wo instances in it. This is why the regex matches both Batwoman and Batman.
      Can you think of? how to say: "Match zero or one of the group preceding this question mark"
      If you need to match the actual question mark character, escape it with /?.

      Match zero or more with a star

      A * (called an asterisk or an asterisk) means “matches zero or more” — the group that precedes the star can appear any number of times in the text. It can be completely absent or repeated over and over again. Let`s look at the Batman example again.

      # Python program for illustration
      # match regular expression
      # with an asterisk (*)

      import re

      batRegex = re. compile (r `Bat (wo) * man` )

      mo1 = batRegex.search ( `The Adventures of Batman` )

      print (mo1.group ())

      OUTPUT:

       `Batman` 

      # python illustration program
      # regular expression match
      # with an asterisk (*)

      import re

      batRegex = re. compile (r `Bat (wo) * man` )

      mo2 = batRegex.search ( ` The Adventures of Batwoman` )

      print (mo2.group ())

      OUTPUT:

       `Batwoman` 

      # Python program for illustration
      # match regular expression
      # with asterisk (*)

      import re

      batRegex = re. compile (r `Bat (wo) * man` )

      mo3 = batRegex.search ( `The Adventures of Batwowowowoman` )

      print (mo3 .group ())

      OUTPUT:

       `Batwowowowoman` 

      For" Batman "part (wo) * regular expression matches zero occurrences of wo in the string; for "Batwoman" (wo) * corresponds to one case of wo; and for "Batwowowowoman" (wo) * matches four cases of wo.

      If you need to match the actual star character, add a backslash / * before the asterisk in the regex.

      Match one or more with a plus

      While * means "match zero or more", + (or plus) means "match one or more." Unlike a star, which does not require its group to appear on the corresponding line, the group preceding the plus must appear at least once. It`s not obligatory. Enter the following into an interactive shell and compare it to the stellar regular expressions in the previous section:

      # Python program for illustration
      # match regex
      # with a plus (+)

      import re

      batRegex = re. compile (r ` Bat (wo) + man` )

      mo1 = batRegex.search ( `The Adventures of Batwoman` )

      print (mo1.group ())

      OUTPUT:

       `Batwoman` 

      # Python program for illustration
      # match regular expression
      # with plus (+)

      import re

      batRegex = re. compile (r `Bat (wo) + man` )

      mo2 = batRegex.search ( `The Adventures of Batwowowowoman` )

      print (mo2.group ())

      OUTPUT:

       `Batwowowowoman`  

      batRegex = re.compile (r & # 39; Bat (wo) + man & # 39;)

      OUTPUT:

       True 

      Bat (wo) + man regex will not match "Batman Adventures" because the plus sign requires at least one wo.

      If you need to match a valid character with a plus sign, prepend the plus sign with a backslash: / +.

      This article courtesy of Shubham Machal . If you are as Python.Engineering and would like to contribute, you can also write an article using contribute.python.engineering or by posting the article [email protected] ... See my article appearing on the Python.Engineering homepage and help other geeks.

      Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.

      Get Solution for free from DataCamp guru

      # Python program for illustration
      # match regex
      # with plus (+)

      import re

      batRegex = re. compile (r `Bat (wo) + man` )

      mo3 = batRegex.search ( `The Adventures of Batman` ) = = None

      print (mo3)