Speed up millions of regex replacements in Python 3

| | | | | | | | | | | | | |

👻 Check our latest review to choose the best laptop for Machine Learning engineers and Deep learning tasks!

I have two lists:

  • a list of about 750K "sentences" (long strings)
  • a list of about 20K "words" that I would like to delete from my 750K sentences

So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the  word-boundary metacharacter:

compiled_words = [re.compile(r"" + word + r"") for word in my20000words]

Then I loop through my "sentences":

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it"s not much of an improvement.

I"m using Python 3.5.2

👻 Read also: what is the best laptop for engineering students?

Speed up millions of regex replacements in Python 3 __del__: Questions

How can I make a time delay in Python?

5 answers

I would like to know how to put a time delay in a Python script.

2973

Answer #1

import time
time.sleep(5)   # Delays for 5 seconds. You can also use a float value.

Here is another example where something is run approximately once a minute:

import time
while True:
    print("This prints once a minute.")
    time.sleep(60) # Delay for 1 minute (60 seconds).

2973

Answer #2

You can use the sleep() function in the time module. It can take a float argument for sub-second resolution.

from time import sleep
sleep(0.1) # Time in seconds

Speed up millions of regex replacements in Python 3 __del__: Questions

How to delete a file or folder in Python?

5 answers

How do I delete a file or folder in Python?

2639

Answer #1


Path objects from the Python 3.4+ pathlib module also expose these instance methods:

2639

Answer #2


Path objects from the Python 3.4+ pathlib module also expose these instance methods:

2639

Answer #3

Python syntax to delete a file

import os
os.remove("/tmp/<file_name>.txt")

Or

import os
os.unlink("/tmp/<file_name>.txt")

Or

pathlib Library for Python version >= 3.4

file_to_rem = pathlib.Path("/tmp/<file_name>.txt")
file_to_rem.unlink()

Path.unlink(missing_ok=False)

Unlink method used to remove the file or the symbolik link.

If missing_ok is false (the default), FileNotFoundError is raised if the path does not exist.
If missing_ok is true, FileNotFoundError exceptions will be ignored (same behavior as the POSIX rm -f command).
Changed in version 3.8: The missing_ok parameter was added.

Best practice

  1. First, check whether the file or folder exists or not then only delete that file. This can be achieved in two ways :
    a. os.path.isfile("/path/to/file")
    b. Use exception handling.

EXAMPLE for os.path.isfile

#!/usr/bin/python
import os
myfile="/tmp/foo.txt"

## If file exists, delete it ##
if os.path.isfile(myfile):
    os.remove(myfile)
else:    ## Show an error ##
    print("Error: %s file not found" % myfile)

Exception Handling

#!/usr/bin/python
import os

## Get input ##
myfile= raw_input("Enter file name to delete: ")

## Try to delete the file ##
try:
    os.remove(myfile)
except OSError as e:  ## if failed, report it back to the user ##
    print ("Error: %s - %s." % (e.filename, e.strerror))

RESPECTIVE OUTPUT

Enter file name to delete : demo.txt
Error: demo.txt - No such file or directory.

Enter file name to delete : rrr.txt
Error: rrr.txt - Operation not permitted.

Enter file name to delete : foo.txt

Python syntax to delete a folder

shutil.rmtree()

Example for shutil.rmtree()

#!/usr/bin/python
import os
import sys
import shutil

# Get directory name
mydir= raw_input("Enter directory name: ")

## Try to remove tree; if failed show an error using try...except on screen
try:
    shutil.rmtree(mydir)
except OSError as e:
    print ("Error: %s - %s." % (e.filename, e.strerror))

Is there a simple way to delete a list element by value?

5 answers

I want to remove a value from a list if it exists in the list (which it may not).

a = [1, 2, 3, 4]
b = a.index(6)

del a[b]
print(a)

The above case (in which it does not exist) shows the following error:

Traceback (most recent call last):
  File "D:zjm_codea.py", line 6, in <module>
    b = a.index(6)
ValueError: list.index(x): x not in list

So I have to do this:

a = [1, 2, 3, 4]

try:
    b = a.index(6)
    del a[b]
except:
    pass

print(a)

But is there not a simpler way to do this?

1055

Answer #1

To remove an element"s first occurrence in a list, simply use list.remove:

>>> a = ["a", "b", "c", "d"]
>>> a.remove("b")
>>> print(a)
["a", "c", "d"]

Mind that it does not remove all occurrences of your element. Use a list comprehension for that.

>>> a = [10, 20, 30, 40, 20, 30, 40, 20, 70, 20]
>>> a = [x for x in a if x != 20]
>>> print(a)
[10, 30, 40, 30, 40, 70]

We hope this article has helped you to resolve the problem. Apart from Speed up millions of regex replacements in Python 3, check other __del__-related topics.

Want to excel in Python? See our review of the best Python online courses 2023. If you are interested in Data Science, check also how to learn programming in R.

By the way, this material is also available in other languages:



Olivia Zelotti

Abu Dhabi | 2023-04-01

code Python module is always a bit confusing 😭 Speed up millions of regex replacements in Python 3 is not the only problem I encountered. Will use it in my bachelor thesis

Schneider Williams

Tallinn | 2023-04-01

Maybe there are another answers? What Speed up millions of regex replacements in Python 3 exactly means?. I just hope that will not emerge anymore

Oliver Innsbruck

Vigrinia | 2023-04-01

Thanks for explaining! I was stuck with Speed up millions of regex replacements in Python 3 for some hours, finally got it done 🤗. Will get back tomorrow with feedback

Shop

Gifts for programmers

Learn programming in R: courses

$FREE
Gifts for programmers

Best Python online courses for 2022

$FREE
Gifts for programmers

Best laptop for Fortnite

$399+
Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best computer for crypto mining

$499+
Gifts for programmers

Best laptop for Sims 4

$

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically