Speed up millions of regex replacements in Python 3

| | |

👻 Check our latest review to choose the best laptop for Machine Learning engineers and Deep learning tasks!

I have two lists:

  • a list of about 750K "sentences" (long strings)
  • a list of about 20K "words" that I would like to delete from my 750K sentences

So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the  word-boundary metacharacter:

compiled_words = [re.compile(r"" + word + r"") for word in my20000words]

Then I loop through my "sentences":

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it"s not much of an improvement.

I"m using Python 3.5.2

👻 Read also: what is the best laptop for engineering students?

Speed up millions of regex replacements in Python 3 __del__: Questions

How can I make a time delay in Python?

5 answers

I would like to know how to put a time delay in a Python script.

2973

Answer #1

import time
time.sleep(5)   # Delays for 5 seconds. You can also use a float value.

Here is another example where something is run approximately once a minute:

import time
while True:
    print("This prints once a minute.")
    time.sleep(60) # Delay for 1 minute (60 seconds).

2973

Answer #2

You can use the sleep() function in the time module. It can take a float argument for sub-second resolution.

from time import sleep
sleep(0.1) # Time in seconds

Speed up millions of regex replacements in Python 3 __del__: Questions

How to delete a file or folder in Python?

5 answers

How do I delete a file or folder in Python?

2639

Answer #1


Path objects from the Python 3.4+ pathlib module also expose these instance methods:

2639

Answer #2


Path objects from the Python 3.4+ pathlib module also expose these instance methods:

2639

Answer #3

Python syntax to delete a file

import os
os.remove("/tmp/<file_name>.txt")

Or

import os
os.unlink("/tmp/<file_name>.txt")

Or

pathlib Library for Python version >= 3.4

file_to_rem = pathlib.Path("/tmp/<file_name>.txt")
file_to_rem.unlink()

Path.unlink(missing_ok=False)

Unlink method used to remove the file or the symbolik link.

If missing_ok is false (the default), FileNotFoundError is raised if the path does not exist.
If missing_ok is true, FileNotFoundError exceptions will be ignored (same behavior as the POSIX rm -f command).
Changed in version 3.8: The missing_ok parameter was added.

Best practice

  1. First, check whether the file or folder exists or not then only delete that file. This can be achieved in two ways :
    a. os.path.isfile("/path/to/file")
    b. Use exception handling.

EXAMPLE for os.path.isfile

#!/usr/bin/python
import os
myfile="/tmp/foo.txt"

## If file exists, delete it ##
if os.path.isfile(myfile):
    os.remove(myfile)
else:    ## Show an error ##
    print("Error: %s file not found" % myfile)

Exception Handling

#!/usr/bin/python
import os

## Get input ##
myfile= raw_input("Enter file name to delete: ")

## Try to delete the file ##
try:
    os.remove(myfile)
except OSError as e:  ## if failed, report it back to the user ##
    print ("Error: %s - %s." % (e.filename, e.strerror))

RESPECTIVE OUTPUT

Enter file name to delete : demo.txt
Error: demo.txt - No such file or directory.

Enter file name to delete : rrr.txt
Error: rrr.txt - Operation not permitted.

Enter file name to delete : foo.txt

Python syntax to delete a folder

shutil.rmtree()

Example for shutil.rmtree()

#!/usr/bin/python
import os
import sys
import shutil

# Get directory name
mydir= raw_input("Enter directory name: ")

## Try to remove tree; if failed show an error using try...except on screen
try:
    shutil.rmtree(mydir)
except OSError as e:
    print ("Error: %s - %s." % (e.filename, e.strerror))

Is there a simple way to delete a list element by value?

5 answers

I want to remove a value from a list if it exists in the list (which it may not).

a = [1, 2, 3, 4]
b = a.index(6)

del a[b]
print(a)

The above case (in which it does not exist) shows the following error:

Traceback (most recent call last):
  File "D:zjm_codea.py", line 6, in <module>
    b = a.index(6)
ValueError: list.index(x): x not in list

So I have to do this:

a = [1, 2, 3, 4]

try:
    b = a.index(6)
    del a[b]
except:
    pass

print(a)

But is there not a simpler way to do this?

1055

Answer #1

To remove an element"s first occurrence in a list, simply use list.remove:

>>> a = ["a", "b", "c", "d"]
>>> a.remove("b")
>>> print(a)
["a", "c", "d"]

Mind that it does not remove all occurrences of your element. Use a list comprehension for that.

>>> a = [10, 20, 30, 40, 20, 30, 40, 20, 70, 20]
>>> a = [x for x in a if x != 20]
>>> print(a)
[10, 30, 40, 30, 40, 70]

We hope this article has helped you to resolve the problem. Apart from Speed up millions of regex replacements in Python 3, check other __del__-related topics.

Want to excel in Python? See our review of the best Python online courses 2022. If you are interested in Data Science, check also how to learn programming in R.

By the way, this material is also available in other languages:



Xu Wu

Berlin | 2022-11-30

Simply put and clear. Thank you for sharing. Speed up millions of regex replacements in Python 3 and other issues with __del__ was always my weak point 😁. Checked yesterday, it works!

Dmitry Gonzalez

Shanghai | 2022-11-30

__del__ is always a bit confusing 😭 Speed up millions of regex replacements in Python 3 is not the only problem I encountered. I am just not quite sure it is the best method

Frank OConnell

California | 2022-11-30

Maybe there are another answers? What Speed up millions of regex replacements in Python 3 exactly means?. I just hope that will not emerge anymore

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically