Split string into columns using regex in pandas DataFrame

Method # 1 :
In this method we will use re.search (pattern, string, flags = 0) . Here, pattern refers to the pattern we want to find. It accepts a string with the following values:

  • / w matches alphanumeric characters
  • / d matches numbers, which means 0-9
  • / s matches whitespace
  • / S matches non-whitespace characters
  • , matches any character other than the newline character / n
  • * matches 0 or more instances of the pattern

# import regex library

import pandas as pd

import re

  
# Create a list with all lines

movie_data = [ "Name: The_Godfather Year: 1972 Rating: 9.2" ,

"Name: Bird_Box Year: 2018 Rating: 6.8" ,

"Name: Fight_Club Year: 1999 Rating: 8.8 " ]

  
# Create a dictionary with the required columns
# Used later to convert to DataFrame

movies = { " Name " : [], "Year" : [], " Rating " : []}

  < / p>

for item in movie_data:

  

  # For name field

name_field = re.search ( "Name:. *" , item)

  

  if name_field is not None :

name = re.search ( `w * sw * ` , name _field.group ())

else :

name = None

movies [ "Name" ]. append (name.group ())

 

  # Year field

year_field = re.search ( "Year:. *" , item)

if year_field is not None  :

year = re.search ( `sdddd` , year_field.group ())

  else :

year = None

movies [ "Year" ]. append (year.group (). strip ())

 

# For the rating field

rating_field = re.search ( " Rating:. * " , item)

  if rating_field is not None

rating = re.search ( `sd.d ` , rating_field.group ())

  else

  rating - None

movies [ "Rating" ]. append (rating.group (). strip ())

  
# Create DataFrame

df = pd.DataFrame (movies)

print (df)

Output:

Explanation :

  • In the above code, we are using a for loop to iterate over the movie data so that we can work with each movie in turn. We create a movie dictionary that will store all the details of every detail such as rating and title.
  • We then find the complete Name field using re.search () . means any character except / n and * expands it to the end of the line. Assign that to the variable name_field .
  • But the data isn`t always straightforward. This may contain surprises. For example, what if there is no Name: field? The script will give an error and break. We anticipate errors from this script and check for " No" .
  • We again use the re.search () function to retrieve the last required row from the name_field. For the name, we use / w * to represent the first word, / s to represent the space between them, and / w * for the second word.
  • Do the same for the year and rankings and get the final required vocabulary.

Method # 2:
To split a string, we will use Series.str.extract (pat, flags = 0, expand = True) . Here pat refers to the pattern we want to find.

import pandas as pd

 

dict = { `movie_data` : [ `The Godfather 1972 9.2` ,

  `Bird Box 2018 6.8` ,

  `Fight Club 1999 8.8` ]}

  
# Convert dictionary to data frame

df = pd.DataFrame ( dict )

  
# Extract name from string

df [ `Name` ] = df [ `movie_data` ]. str . extract ( `(w * sw *)` , expand = True )

  
# Extract year from string

df [ `Year` ] = df [ `movie_data` ].   str . extract ( `(dddd)` , expand = True )

 
# Extract rating from string

df [ `Rating` ] = df [ `movie_data` ]. str . extract ( ` (dd) ` , expand = True )

print (df)

Output: