string.punctuation in Python

punctuation | Python Methods and Functions | String Variables

In Python3 string.punctuation — it is a pre-initialized string used as a string constant. In Python, string.punctuation will give all sets of punctuation marks.

Syntax: string.punctuation

Parameters: Doesn't take any parameter, since it's not a function.

Returns: Return all sets of punctuation.

Note. Make sure to import the string.punctuation library function to use string.punctuation

Code # 1:

# import string library function

import string 

 
# Storage punctuation sets in variable result

result = string.pu nctuation 

 
# Print punctuation values ​​

print (result) 

Output:

! "# $% & amp;' () * +, -./:;<= & gt;? @ [] ^ _ '{|} ~ 

Code # 2: Punctuation check code is given.

# import string library function

import string 

 
# Input line.

Sentence = " Hey, Geeks!, How are you? "

 

for i   in Sentence:

 

# check if a character is punctuation.

if i in string.punctuation:

 

# Print punctuation values ​​

print ( "Punctuation:" + i)

 

Exit:

 Punctuation :, Punctuation:! Punctuation :, Punctuation:? 




string.punctuation in Python: StackOverflow Questions

Best way to strip punctuation from a string

Question by Lawrence Johnston

It seems like there should be a simpler way than:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("";""), string.punctuation)

Is there?

Remove all special characters, punctuation and spaces from string

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

How to get rid of punctuation using NLTK tokenizer?

I"m just starting to use NLTK and I don"t quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn"t work with multiple sentences: dots are added to the last word.

Answer #1

You can"t. Backslashes cannot appear inside the curly braces {}; doing so results in a SyntaxError:

>>> f"{}"
SyntaxError: f-string expression part cannot include a backslash

This is specified in the PEP for f-strings:

Backslashes may not appear inside the expression portions of f-strings, [...]

One option is assinging " " to a name and then .join on that inside the f-string; that is, without using a literal:

names = ["Adam", "Bob", "Cyril"]
nl = "
"
text = f"Winners are:{nl}{nl.join(names)}"
print(text)

Results in:

Winners are:
Adam
Bob
Cyril

Another option, as specified by @wim, is to use chr(10) to get returned and then join there. f"Winners are: {chr(10).join(names)}"

Yet another, of course, is to " ".join beforehand and then add the name accordingly:

n = "
".join(names)
text = f"Winners are:
{n}"

which results in the same output.

Note:

This is one of the small differences between f-strings and str.format. In the latter, you can always use punctuation granted that a corresponding wacky dict is unpacked that contains those keys:

>>> "{\} {*}".format(**{"\": "Hello", "*": "World!"})
"Hello World!"

(Please don"t do this.)

In the former, punctuation isn"t allowed because you can"t have identifiers that use them.


Aside: I would definitely opt for print or format, as the other answers suggest as an alternative. The options I"ve given only apply if you must for some reason use f-strings.

Just because something is new, doesn"t mean you should try and do everything with it ;-)

Answer #2

TLDR; No, for loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (...depending on what you"re doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let"s examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What"s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

enter image description here

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example - this time, with another vanilla python construct that is faster than a for loop - collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

enter image description here

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don"t give you the performance you need, there is always cython and numba. Let"s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

enter image description here

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

enter image description here

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

# Dictionary value extraction.
ser.map(operator.itemgetter("value"))     # map
pd.Series([x.get("value") for x in ser])  # list comprehension

enter image description here

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

enter image description here

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

enter image description here

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it"s that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other "vectorized" string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python's re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df["col2"] = [p.search(x).group(0) for x in df["col"]]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df["col2"] = [matcher(x) for x in df["col"]]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r"(?<=[A-Z])(d{4})")
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

enter image description here

More Examples
Full disclosure - I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it"s up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"]),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query("A != B"),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=["vectorized !=", "query (numexpr)", "list comp", "numba"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N"
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=["value_counts", "np.unique", "Counter"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query("A != B"),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=["vectorized !=", "query (numexpr)", "list comp"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{"key": "abc", "value": 123}, {"key": "xyz", "value": 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter("value")),
        lambda ser: pd.Series([x.get("value") for x in ser]),
    ],
    labels=["map", "list comprehension"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([["a", "b", "c"], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=["map", "str accessor", "list comprehension", "list comp safe"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([["a", "b", "c"], ["d", "e"], ["f", "g"]])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=["stack", "itertools.chain", "nested list comp"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(["foo xyz", "test A1234", "D3345 xtz"])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=["str.extract", "list comprehension"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

Answer #3

For the convenience of usage, I sum up the note of striping punctuation from a string in both Python 2 and Python 3. Please refer to other answers for the detailed description.


Python 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("";"")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3

import string

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table)                          # Output: string without punctuation

Answer #4

From an efficiency perspective, you"re not going to beat

s.translate(None, string.punctuation)

For higher versions of Python use the following code:

s.translate(str.maketrans("", "", string.punctuation))

It"s performing raw string operations in C with a lookup table - there"s not much that will beat that but writing your own C code.

If speed isn"t a worry, another option though is:

exclude = set(string.punctuation)
s = "".join(ch for ch in s if ch not in exclude)

This is faster than s.replace with each char, but won"t perform as well as non-pure python approaches such as regexes or string.translate, as you can see from the below timings. For this type of problem, doing it at as low a level as possible pays off.

Timing code:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("";"")
regex = re.compile("[%s]" % re.escape(string.punctuation))

def test_set(s):
    return "".join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko"s solution, with fix.
    return regex.sub("", s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott"s solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer("f(s)", "from __main__ import s,test_set as f").timeit(1000000)
print "regex     :",timeit.Timer("f(s)", "from __main__ import s,test_re as f").timeit(1000000)
print "translate :",timeit.Timer("f(s)", "from __main__ import s,test_trans as f").timeit(1000000)
print "replace   :",timeit.Timer("f(s)", "from __main__ import s,test_repl as f").timeit(1000000)

This gives the following results:

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

Answer #5

>>> import string
>>> string.ascii_lowercase
"abcdefghijklmnopqrstuvwxyz"

If you really need a list:

>>> list(string.ascii_lowercase)
["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

And to do it with range

>>> list(map(chr, range(97, 123))) #or list(map(chr, range(ord("a"), ord("z")+1)))
["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

Other helpful string module features:

>>> help(string) # on Python 3
....
DATA
    ascii_letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    ascii_lowercase = "abcdefghijklmnopqrstuvwxyz"
    ascii_uppercase = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    digits = "0123456789"
    hexdigits = "0123456789abcdefABCDEF"
    octdigits = "01234567"
    printable = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&"()*+,-./:;<=>[email protected][\]^_`{|}~ 	

x0bx0c"
    punctuation = "!"#$%&"()*+,-./:;<=>[email protected][\]^_`{|}~"
    whitespace = " 	

x0bx0c"

Answer #6

The question you reference asks which languages promote both OO and functional programming. Python does not promote functional programming even though it works fairly well.

The best argument against functional programming in Python is that imperative/OO use cases are carefully considered by Guido, while functional programming use cases are not. When I write imperative Python, it"s one of the prettiest languages I know. When I write functional Python, it becomes as ugly and unpleasant as your average language that doesn"t have a BDFL.

Which is not to say that it"s bad, just that you have to work harder than you would if you switched to a language that promotes functional programming or switched to writing OO Python.

Here are the functional things I miss in Python:


  • No pattern matching and no tail recursion mean your basic algorithms have to be written imperatively. Recursion is ugly and slow in Python.
  • A small list library and no functional dictionaries mean that you have to write a lot of stuff yourself.
  • No syntax for currying or composition means that point-free style is about as full of punctuation as explicitly passing arguments.
  • Iterators instead of lazy lists means that you have to know whether you want efficiency or persistence, and to scatter calls to list around if you want persistence. (Iterators are use-once)
  • Python"s simple imperative syntax, along with its simple LL1 parser, mean that a better syntax for if-expressions and lambda-expressions is basically impossible. Guido likes it this way, and I think he"s right.

Answer #7

Readable regular expressions

In Python you can split a regular expression over multiple lines, name your matches and insert comments.

Example verbose syntax (from Dive into Python):

>>> pattern = """
... ^                   # beginning of string
... M{0,4}              # thousands - 0 to 4 M"s
... (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C"s),
...                     #            or 500-800 (D, followed by 0 to 3 C"s)
... (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X"s),
...                     #        or 50-80 (L, followed by 0 to 3 X"s)
... (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I"s),
...                     #        or 5-8 (V, followed by 0 to 3 I"s)
... $                   # end of string
... """
>>> re.search(pattern, "M", re.VERBOSE)

Example naming matches (from Regular Expression HOWTO)

>>> p = re.compile(r"(?P<word>w+)")
>>> m = p.search( "(((( Lots of punctuation )))" )
>>> m.group("word")
"Lots"

You can also verbosely write a regex without using re.VERBOSE thanks to string literal concatenation.

>>> pattern = (
...     "^"                 # beginning of string
...     "M{0,4}"            # thousands - 0 to 4 M"s
...     "(CM|CD|D?C{0,3})"  # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C"s),
...                         #            or 500-800 (D, followed by 0 to 3 C"s)
...     "(XC|XL|L?X{0,3})"  # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X"s),
...                         #        or 50-80 (L, followed by 0 to 3 X"s)
...     "(IX|IV|V?I{0,3})"  # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I"s),
...                         #        or 5-8 (V, followed by 0 to 3 I"s)
...     "$"                 # end of string
... )
>>> print pattern
"^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$"

Answer #8

Regular expressions are simple enough, if you know them.

import re
s = "string. With. Punctuation?"
s = re.sub(r"[^ws]","",s)

Get Solution for free from DataCamp guru