Python | Pandas Series.str.strip (), lstrip () and rstrip ()

Python Methods and Functions | rstrip

Pandas provide 3 methods to handle whitespace (including newline) in any text data. As the name suggests, str.lstrip () is used to remove spaces from the left side of a string, str.rstrip () is used to remove spaces from the right side of a string, and str.strip () removes spaces from both sides. Since these are pandas functions with the same name as the default Python functions, you need to add the prefix .str to tell the compiler to call the Pandas function.

Syntax: Series.str.strip()

Return Type: Series with removed spaces

To download the CSV used in the code, click here.

In the following examples, the data frame used contains data for some NBA players. Since none of the values ​​in the data frame have extra spaces, spaces are added to some elements with str.replace () method . The image of the data frame before any operations is shown below. 

Example # 1: Using lstrip ()

This example creates a new series, similar to the "Command" column, with 2 spaces at the beginning and end of the line. After that str.lstrip() method str.lstrip() which is str .lstrip () with custom string with left spaces removed.

# pandas module import

import pandas as pd

  
# create data frame

data = pd.read_csv ( " https://media.python.engineering/wp-content/uploads/nba.csv " )

  
# replacement of the name commands and adding spaces at the beginning and end

new = data [ " Team " ]. replace ( " Boston Celtics " , " Boston Celtics " ). Copy ()

 
# check with line removed

new. str . lstrip () = = "Boston Celtics "

Exit:
As shown in the output image, the comparison is performed after removing the spaces on the left. 

Example # 2: Using strip ()

This example uses the str.strip () method to remove spaces from both the left and right sides of a string. A new copy of the Command column is created with two leading and trailing spaces. Then the str.strip () method is str.strip () for that series. It is then compared with Boston Celtics, Boston Celtics and Boston Celtics to see if the spaces have been removed on both sides or not.

# pandas module import

import pandas as pd

 
# create data frame

data = pd.read_csv ( " https://media.python.engineering/wp-content/uploads/nba.csv " )

  
# replace command name and add leading and trailing spaces

new = data [ "Team" ]. replace ( "Boston Celtics" , " Boston Celtics " ). copy ()

  
# check with custom string

new. str . strip () = = " Boston Celtic"

new. str . strip () = = "Boston Celtics "

new. str . Strip () = = " Boston Celtic "

Exit:
As shown in the output image, the comparison returns False for all 3 conditions, which means the spaces were successfully removed from both sides, and there are no more spaces in the line. 

Example # 3: Using rstrip ()

This example creates a new series, similar to the "Command" column, with 2 spaces at the beginning and end of the line. After that the str.rstrip () method is str.rstrip () which is str.rstrip () with a custom string with right spaces removed.

# pandas module import

import pandas as pd

  
# create data frame

data = pd.read_csv ( " https:// media. python.engineering/wp-content/uploads/nba.csv " )

  
# replace the command name and add leading and trailing spaces < / p>

new = data [ "Team" ]. replace ( "Boston Celtics" , " Boston Celtics " ). copy ()

 
# check with line removed

new. str . rstrip () = = " Boston Celtics"

Output:
As shown in the output image, the comparison is performed after removing the right spaces. 





Python | Pandas Series.str.strip (), lstrip () and rstrip (): StackOverflow Questions

Answer #1

To somewhat expand on the earlier answers here, there are a number of details which are commonly overlooked.

  • Prefer subprocess.run() over subprocess.check_call() and friends over subprocess.call() over subprocess.Popen() over os.system() over os.popen()
  • Understand and probably use text=True, aka universal_newlines=True.
  • Understand the meaning of shell=True or shell=False and how it changes quoting and the availability of shell conveniences.
  • Understand differences between sh and Bash
  • Understand how a subprocess is separate from its parent, and generally cannot change the parent.
  • Avoid running the Python interpreter as a subprocess of Python.

These topics are covered in some more detail below.

Prefer subprocess.run() or subprocess.check_call()

The subprocess.Popen() function is a low-level workhorse but it is tricky to use correctly and you end up copy/pasting multiple lines of code ... which conveniently already exist in the standard library as a set of higher-level wrapper functions for various purposes, which are presented in more detail in the following.

Here"s a paragraph from the documentation:

The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying Popen interface can be used directly.

Unfortunately, the availability of these wrapper functions differs between Python versions.

  • subprocess.run() was officially introduced in Python 3.5. It is meant to replace all of the following.
  • subprocess.check_output() was introduced in Python 2.7 / 3.1. It is basically equivalent to subprocess.run(..., check=True, stdout=subprocess.PIPE).stdout
  • subprocess.check_call() was introduced in Python 2.5. It is basically equivalent to subprocess.run(..., check=True)
  • subprocess.call() was introduced in Python 2.4 in the original subprocess module (PEP-324). It is basically equivalent to subprocess.run(...).returncode

High-level API vs subprocess.Popen()

The refactored and extended subprocess.run() is more logical and more versatile than the older legacy functions it replaces. It returns a CompletedProcess object which has various methods which allow you to retrieve the exit status, the standard output, and a few other results and status indicators from the finished subprocess.

subprocess.run() is the way to go if you simply need a program to run and return control to Python. For more involved scenarios (background processes, perhaps with interactive I/O with the Python parent program) you still need to use subprocess.Popen() and take care of all the plumbing yourself. This requires a fairly intricate understanding of all the moving parts and should not be undertaken lightly. The simpler Popen object represents the (possibly still-running) process which needs to be managed from your code for the remainder of the lifetime of the subprocess.

It should perhaps be emphasized that just subprocess.Popen() merely creates a process. If you leave it at that, you have a subprocess running concurrently alongside with Python, so a "background" process. If it doesn"t need to do input or output or otherwise coordinate with you, it can do useful work in parallel with your Python program.

Avoid os.system() and os.popen()

Since time eternal (well, since Python 2.5) the os module documentation has contained the recommendation to prefer subprocess over os.system():

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

The problems with system() are that it"s obviously system-dependent and doesn"t offer ways to interact with the subprocess. It simply runs, with standard output and standard error outside of Python"s reach. The only information Python receives back is the exit status of the command (zero means success, though the meaning of non-zero values is also somewhat system-dependent).

PEP-324 (which was already mentioned above) contains a more detailed rationale for why os.system is problematic and how subprocess attempts to solve those issues.

os.popen() used to be even more strongly discouraged:

Deprecated since version 2.6: This function is obsolete. Use the subprocess module.

However, since sometime in Python 3, it has been reimplemented to simply use subprocess, and redirects to the subprocess.Popen() documentation for details.

Understand and usually use check=True

You"ll also notice that subprocess.call() has many of the same limitations as os.system(). In regular use, you should generally check whether the process finished successfully, which subprocess.check_call() and subprocess.check_output() do (where the latter also returns the standard output of the finished subprocess). Similarly, you should usually use check=True with subprocess.run() unless you specifically need to allow the subprocess to return an error status.

In practice, with check=True or subprocess.check_*, Python will throw a CalledProcessError exception if the subprocess returns a nonzero exit status.

A common error with subprocess.run() is to omit check=True and be surprised when downstream code fails if the subprocess failed.

On the other hand, a common problem with check_call() and check_output() was that users who blindly used these functions were surprised when the exception was raised e.g. when grep did not find a match. (You should probably replace grep with native Python code anyway, as outlined below.)

All things counted, you need to understand how shell commands return an exit code, and under what conditions they will return a non-zero (error) exit code, and make a conscious decision how exactly it should be handled.

Understand and probably use text=True aka universal_newlines=True

Since Python 3, strings internal to Python are Unicode strings. But there is no guarantee that a subprocess generates Unicode output, or strings at all.

(If the differences are not immediately obvious, Ned Batchelder"s Pragmatic Unicode is recommended, if not outright obligatory, reading. There is a 36-minute video presentation behind the link if you prefer, though reading the page yourself will probably take significantly less time.)

Deep down, Python has to fetch a bytes buffer and interpret it somehow. If it contains a blob of binary data, it shouldn"t be decoded into a Unicode string, because that"s error-prone and bug-inducing behavior - precisely the sort of pesky behavior which riddled many Python 2 scripts, before there was a way to properly distinguish between encoded text and binary data.

With text=True, you tell Python that you, in fact, expect back textual data in the system"s default encoding, and that it should be decoded into a Python (Unicode) string to the best of Python"s ability (usually UTF-8 on any moderately up to date system, except perhaps Windows?)

If that"s not what you request back, Python will just give you bytes strings in the stdout and stderr strings. Maybe at some later point you do know that they were text strings after all, and you know their encoding. Then, you can decode them.

normal = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True,
    text=True)
print(normal.stdout)

convoluted = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True)
# You have to know (or guess) the encoding
print(convoluted.stdout.decode("utf-8"))

Python 3.7 introduced the shorter and more descriptive and understandable alias text for the keyword argument which was previously somewhat misleadingly called universal_newlines.

Understand shell=True vs shell=False

With shell=True you pass a single string to your shell, and the shell takes it from there.

With shell=False you pass a list of arguments to the OS, bypassing the shell.

When you don"t have a shell, you save a process and get rid of a fairly substantial amount of hidden complexity, which may or may not harbor bugs or even security problems.

On the other hand, when you don"t have a shell, you don"t have redirection, wildcard expansion, job control, and a large number of other shell features.

A common mistake is to use shell=True and then still pass Python a list of tokens, or vice versa. This happens to work in some cases, but is really ill-defined and could break in interesting ways.

# XXX AVOID THIS BUG
buggy = subprocess.run("dig +short stackoverflow.com")

# XXX AVOID THIS BUG TOO
broken = subprocess.run(["dig", "+short", "stackoverflow.com"],
    shell=True)

# XXX DEFINITELY AVOID THIS
pathological = subprocess.run(["dig +short stackoverflow.com"],
    shell=True)

correct = subprocess.run(["dig", "+short", "stackoverflow.com"],
    # Probably don"t forget these, too
    check=True, text=True)

# XXX Probably better avoid shell=True
# but this is nominally correct
fixed_but_fugly = subprocess.run("dig +short stackoverflow.com",
    shell=True,
    # Probably don"t forget these, too
    check=True, text=True)

The common retort "but it works for me" is not a useful rebuttal unless you understand exactly under what circumstances it could stop working.

Refactoring Example

Very often, the features of the shell can be replaced with native Python code. Simple Awk or sed scripts should probably simply be translated to Python instead.

To partially illustrate this, here is a typical but slightly silly example which involves many shell features.

cmd = """while read -r x;
   do ping -c 3 "$x" | grep "round-trip min/avg/max"
   done <hosts.txt"""

# Trivial but horrible
results = subprocess.run(
    cmd, shell=True, universal_newlines=True, check=True)
print(results.stdout)

# Reimplement with shell=False
with open("hosts.txt") as hosts:
    for host in hosts:
        host = host.rstrip("
")  # drop newline
        ping = subprocess.run(
             ["ping", "-c", "3", host],
             text=True,
             stdout=subprocess.PIPE,
             check=True)
        for line in ping.stdout.split("
"):
             if "round-trip min/avg/max" in line:
                 print("{}: {}".format(host, line))

Some things to note here:

  • With shell=False you don"t need the quoting that the shell requires around strings. Putting quotes anyway is probably an error.
  • It often makes sense to run as little code as possible in a subprocess. This gives you more control over execution from within your Python code.
  • Having said that, complex shell pipelines are tedious and sometimes challenging to reimplement in Python.

The refactored code also illustrates just how much the shell really does for you with a very terse syntax -- for better or for worse. Python says explicit is better than implicit but the Python code is rather verbose and arguably looks more complex than this really is. On the other hand, it offers a number of points where you can grab control in the middle of something else, as trivially exemplified by the enhancement that we can easily include the host name along with the shell command output. (This is by no means challenging to do in the shell, either, but at the expense of yet another diversion and perhaps another process.)

Common Shell Constructs

For completeness, here are brief explanations of some of these shell features, and some notes on how they can perhaps be replaced with native Python facilities.

  • Globbing aka wildcard expansion can be replaced with glob.glob() or very often with simple Python string comparisons like for file in os.listdir("."): if not file.endswith(".png"): continue. Bash has various other expansion facilities like .{png,jpg} brace expansion and {1..100} as well as tilde expansion (~ expands to your home directory, and more generally ~account to the home directory of another user)
  • Shell variables like $SHELL or $my_exported_var can sometimes simply be replaced with Python variables. Exported shell variables are available as e.g. os.environ["SHELL"] (the meaning of export is to make the variable available to subprocesses -- a variable which is not available to subprocesses will obviously not be available to Python running as a subprocess of the shell, or vice versa. The env= keyword argument to subprocess methods allows you to define the environment of the subprocess as a dictionary, so that"s one way to make a Python variable visible to a subprocess). With shell=False you will need to understand how to remove any quotes; for example, cd "$HOME" is equivalent to os.chdir(os.environ["HOME"]) without quotes around the directory name. (Very often cd is not useful or necessary anyway, and many beginners omit the double quotes around the variable and get away with it until one day ...)
  • Redirection allows you to read from a file as your standard input, and write your standard output to a file. grep "foo" <inputfile >outputfile opens outputfile for writing and inputfile for reading, and passes its contents as standard input to grep, whose standard output then lands in outputfile. This is not generally hard to replace with native Python code.
  • Pipelines are a form of redirection. echo foo | nl runs two subprocesses, where the standard output of echo is the standard input of nl (on the OS level, in Unix-like systems, this is a single file handle). If you cannot replace one or both ends of the pipeline with native Python code, perhaps think about using a shell after all, especially if the pipeline has more than two or three processes (though look at the pipes module in the Python standard library or a number of more modern and versatile third-party competitors).
  • Job control lets you interrupt jobs, run them in the background, return them to the foreground, etc. The basic Unix signals to stop and continue a process are of course available from Python, too. But jobs are a higher-level abstraction in the shell which involve process groups etc which you have to understand if you want to do something like this from Python.
  • Quoting in the shell is potentially confusing until you understand that everything is basically a string. So ls -l / is equivalent to "ls" "-l" "/" but the quoting around literals is completely optional. Unquoted strings which contain shell metacharacters undergo parameter expansion, whitespace tokenization and wildcard expansion; double quotes prevent whitespace tokenization and wildcard expansion but allow parameter expansions (variable substitution, command substitution, and backslash processing). This is simple in theory but can get bewildering, especially when there are several layers of interpretation (a remote shell command, for example).

Understand differences between sh and Bash

subprocess runs your shell commands with /bin/sh unless you specifically request otherwise (except of course on Windows, where it uses the value of the COMSPEC variable). This means that various Bash-only features like arrays, [[ etc are not available.

If you need to use Bash-only syntax, you can pass in the path to the shell as executable="/bin/bash" (where of course if your Bash is installed somewhere else, you need to adjust the path).

subprocess.run("""
    # This for loop syntax is Bash only
    for((i=1;i<=$#;i++)); do
        # Arrays are Bash-only
        array[i]+=123
    done""",
    shell=True, check=True,
    executable="/bin/bash")

A subprocess is separate from its parent, and cannot change it

A somewhat common mistake is doing something like

subprocess.run("cd /tmp", shell=True)
subprocess.run("pwd", shell=True)  # Oops, doesn"t print /tmp

The same thing will happen if the first subprocess tries to set an environment variable, which of course will have disappeared when you run another subprocess, etc.

A child process runs completely separate from Python, and when it finishes, Python has no idea what it did (apart from the vague indicators that it can infer from the exit status and output from the child process). A child generally cannot change the parent"s environment; it cannot set a variable, change the working directory, or, in so many words, communicate with its parent without cooperation from the parent.

The immediate fix in this particular case is to run both commands in a single subprocess;

subprocess.run("cd /tmp; pwd", shell=True)

though obviously this particular use case isn"t very useful; instead, use the cwd keyword argument, or simply os.chdir() before running the subprocess. Similarly, for setting a variable, you can manipulate the environment of the current process (and thus also its children) via

os.environ["foo"] = "bar"

or pass an environment setting to a child process with

subprocess.run("echo "$foo"", shell=True, env={"foo": "bar"})

(not to mention the obvious refactoring subprocess.run(["echo", "bar"]); but echo is a poor example of something to run in a subprocess in the first place, of course).

Don"t run Python from Python

This is slightly dubious advice; there are certainly situations where it does make sense or is even an absolute requirement to run the Python interpreter as a subprocess from a Python script. But very frequently, the correct approach is simply to import the other Python module into your calling script and call its functions directly.

If the other Python script is under your control, and it isn"t a module, consider turning it into one. (This answer is too long already so I will not delve into details here.)

If you need parallelism, you can run Python functions in subprocesses with the multiprocessing module. There is also threading which runs multiple tasks in a single process (which is more lightweight and gives you more control, but also more constrained in that threads within a process are tightly coupled, and bound to a single GIL.)

Answer #2

According to Python"s Methods of File Objects, the simplest way to convert a text file into a list is:

with open("file.txt") as f:
    my_list = list(f)
    # my_list = [x.rstrip() for x in f] # remove line breaks

If you just need to iterate over the text file lines, you can use:

with open("file.txt") as f:
    for line in f:
       ...

Old answer:

Using with and readlines() :

with open("file.txt") as f:
    lines = f.readlines()

If you don"t care about closing the file, this one-liner works:

lines = open("file.txt").readlines()

The traditional way:

f = open("file.txt") # Open file on read mode
lines = f.read().splitlines() # List with stripped line-breaks
f.close() # Close file

Answer #3

tl;dr

Call the is_path_exists_or_creatable() function defined below.

Strictly Python 3. That"s just how we roll.

A Tale of Two Questions

The question of "How do I test pathname validity and, for valid pathnames, the existence or writability of those paths?" is clearly two separate questions. Both are interesting, and neither have received a genuinely satisfactory answer here... or, well, anywhere that I could grep.

vikki"s answer probably hews the closest, but has the remarkable disadvantages of:

  • Needlessly opening (...and then failing to reliably close) file handles.
  • Needlessly writing (...and then failing to reliable close or delete) 0-byte files.
  • Ignoring OS-specific errors differentiating between non-ignorable invalid pathnames and ignorable filesystem issues. Unsurprisingly, this is critical under Windows. (See below.)
  • Ignoring race conditions resulting from external processes concurrently (re)moving parent directories of the pathname to be tested. (See below.)
  • Ignoring connection timeouts resulting from this pathname residing on stale, slow, or otherwise temporarily inaccessible filesystems. This could expose public-facing services to potential DoS-driven attacks. (See below.)

We"re gonna fix all that.

Question #0: What"s Pathname Validity Again?

Before hurling our fragile meat suits into the python-riddled moshpits of pain, we should probably define what we mean by "pathname validity." What defines validity, exactly?

By "pathname validity," we mean the syntactic correctness of a pathname with respect to the root filesystem of the current system – regardless of whether that path or parent directories thereof physically exist. A pathname is syntactically correct under this definition if it complies with all syntactic requirements of the root filesystem.

By "root filesystem," we mean:

  • On POSIX-compatible systems, the filesystem mounted to the root directory (/).
  • On Windows, the filesystem mounted to %HOMEDRIVE%, the colon-suffixed drive letter containing the current Windows installation (typically but not necessarily C:).

The meaning of "syntactic correctness," in turn, depends on the type of root filesystem. For ext4 (and most but not all POSIX-compatible) filesystems, a pathname is syntactically correct if and only if that pathname:

  • Contains no null bytes (i.e., x00 in Python). This is a hard requirement for all POSIX-compatible filesystems.
  • Contains no path components longer than 255 bytes (e.g., "a"*256 in Python). A path component is a longest substring of a pathname containing no / character (e.g., bergtatt, ind, i, and fjeldkamrene in the pathname /bergtatt/ind/i/fjeldkamrene).

Syntactic correctness. Root filesystem. That"s it.

Question #1: How Now Shall We Do Pathname Validity?

Validating pathnames in Python is surprisingly non-intuitive. I"m in firm agreement with Fake Name here: the official os.path package should provide an out-of-the-box solution for this. For unknown (and probably uncompelling) reasons, it doesn"t. Fortunately, unrolling your own ad-hoc solution isn"t that gut-wrenching...

O.K., it actually is. It"s hairy; it"s nasty; it probably chortles as it burbles and giggles as it glows. But what you gonna do? Nuthin".

We"ll soon descend into the radioactive abyss of low-level code. But first, let"s talk high-level shop. The standard os.stat() and os.lstat() functions raise the following exceptions when passed invalid pathnames:

  • For pathnames residing in non-existing directories, instances of FileNotFoundError.
  • For pathnames residing in existing directories:
    • Under Windows, instances of WindowsError whose winerror attribute is 123 (i.e., ERROR_INVALID_NAME).
    • Under all other OSes:
    • For pathnames containing null bytes (i.e., "x00"), instances of TypeError.
    • For pathnames containing path components longer than 255 bytes, instances of OSError whose errcode attribute is:
      • Under SunOS and the *BSD family of OSes, errno.ERANGE. (This appears to be an OS-level bug, otherwise referred to as "selective interpretation" of the POSIX standard.)
      • Under all other OSes, errno.ENAMETOOLONG.

Crucially, this implies that only pathnames residing in existing directories are validatable. The os.stat() and os.lstat() functions raise generic FileNotFoundError exceptions when passed pathnames residing in non-existing directories, regardless of whether those pathnames are invalid or not. Directory existence takes precedence over pathname invalidity.

Does this mean that pathnames residing in non-existing directories are not validatable? Yes – unless we modify those pathnames to reside in existing directories. Is that even safely feasible, however? Shouldn"t modifying a pathname prevent us from validating the original pathname?

To answer this question, recall from above that syntactically correct pathnames on the ext4 filesystem contain no path components (A) containing null bytes or (B) over 255 bytes in length. Hence, an ext4 pathname is valid if and only if all path components in that pathname are valid. This is true of most real-world filesystems of interest.

Does that pedantic insight actually help us? Yes. It reduces the larger problem of validating the full pathname in one fell swoop to the smaller problem of only validating all path components in that pathname. Any arbitrary pathname is validatable (regardless of whether that pathname resides in an existing directory or not) in a cross-platform manner by following the following algorithm:

  1. Split that pathname into path components (e.g., the pathname /troldskog/faren/vild into the list ["", "troldskog", "faren", "vild"]).
  2. For each such component:
    1. Join the pathname of a directory guaranteed to exist with that component into a new temporary pathname (e.g., /troldskog) .
    2. Pass that pathname to os.stat() or os.lstat(). If that pathname and hence that component is invalid, this call is guaranteed to raise an exception exposing the type of invalidity rather than a generic FileNotFoundError exception. Why? Because that pathname resides in an existing directory. (Circular logic is circular.)

Is there a directory guaranteed to exist? Yes, but typically only one: the topmost directory of the root filesystem (as defined above).

Passing pathnames residing in any other directory (and hence not guaranteed to exist) to os.stat() or os.lstat() invites race conditions, even if that directory was previously tested to exist. Why? Because external processes cannot be prevented from concurrently removing that directory after that test has been performed but before that pathname is passed to os.stat() or os.lstat(). Unleash the dogs of mind-fellating insanity!

There exists a substantial side benefit to the above approach as well: security. (Isn"t that nice?) Specifically:

Front-facing applications validating arbitrary pathnames from untrusted sources by simply passing such pathnames to os.stat() or os.lstat() are susceptible to Denial of Service (DoS) attacks and other black-hat shenanigans. Malicious users may attempt to repeatedly validate pathnames residing on filesystems known to be stale or otherwise slow (e.g., NFS Samba shares); in that case, blindly statting incoming pathnames is liable to either eventually fail with connection timeouts or consume more time and resources than your feeble capacity to withstand unemployment.

The above approach obviates this by only validating the path components of a pathname against the root directory of the root filesystem. (If even that"s stale, slow, or inaccessible, you"ve got larger problems than pathname validation.)

Lost? Great. Let"s begin. (Python 3 assumed. See "What Is Fragile Hope for 300, leycec?")

import errno, os

# Sadly, Python fails to provide the following magic number for us.
ERROR_INVALID_NAME = 123
"""
Windows-specific error code indicating an invalid pathname.

See Also
----------
https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
    Official listing of all such codes.
"""

def is_pathname_valid(pathname: str) -> bool:
    """
    `True` if the passed pathname is a valid pathname for the current OS;
    `False` otherwise.
    """
    # If this pathname is either not a string or is but is empty, this pathname
    # is invalid.
    try:
        if not isinstance(pathname, str) or not pathname:
            return False

        # Strip this pathname"s Windows-specific drive specifier (e.g., `C:`)
        # if any. Since Windows prohibits path components from containing `:`
        # characters, failing to strip this `:`-suffixed prefix would
        # erroneously invalidate all valid absolute Windows pathnames.
        _, pathname = os.path.splitdrive(pathname)

        # Directory guaranteed to exist. If the current OS is Windows, this is
        # the drive to which Windows was installed (e.g., the "%HOMEDRIVE%"
        # environment variable); else, the typical root directory.
        root_dirname = os.environ.get("HOMEDRIVE", "C:") 
            if sys.platform == "win32" else os.path.sep
        assert os.path.isdir(root_dirname)   # ...Murphy and her ironclad Law

        # Append a path separator to this directory if needed.
        root_dirname = root_dirname.rstrip(os.path.sep) + os.path.sep

        # Test whether each path component split from this pathname is valid or
        # not, ignoring non-existent and non-readable path components.
        for pathname_part in pathname.split(os.path.sep):
            try:
                os.lstat(root_dirname + pathname_part)
            # If an OS-specific exception is raised, its error code
            # indicates whether this pathname is valid or not. Unless this
            # is the case, this exception implies an ignorable kernel or
            # filesystem complaint (e.g., path not found or inaccessible).
            #
            # Only the following exceptions indicate invalid pathnames:
            #
            # * Instances of the Windows-specific "WindowsError" class
            #   defining the "winerror" attribute whose value is
            #   "ERROR_INVALID_NAME". Under Windows, "winerror" is more
            #   fine-grained and hence useful than the generic "errno"
            #   attribute. When a too-long pathname is passed, for example,
            #   "errno" is "ENOENT" (i.e., no such file or directory) rather
            #   than "ENAMETOOLONG" (i.e., file name too long).
            # * Instances of the cross-platform "OSError" class defining the
            #   generic "errno" attribute whose value is either:
            #   * Under most POSIX-compatible OSes, "ENAMETOOLONG".
            #   * Under some edge-case OSes (e.g., SunOS, *BSD), "ERANGE".
            except OSError as exc:
                if hasattr(exc, "winerror"):
                    if exc.winerror == ERROR_INVALID_NAME:
                        return False
                elif exc.errno in {errno.ENAMETOOLONG, errno.ERANGE}:
                    return False
    # If a "TypeError" exception was raised, it almost certainly has the
    # error message "embedded NUL character" indicating an invalid pathname.
    except TypeError as exc:
        return False
    # If no exception was raised, all path components and hence this
    # pathname itself are valid. (Praise be to the curmudgeonly python.)
    else:
        return True
    # If any other exception was raised, this is an unrelated fatal issue
    # (e.g., a bug). Permit this exception to unwind the call stack.
    #
    # Did we mention this should be shipped with Python already?

Done. Don"t squint at that code. (It bites.)

Question #2: Possibly Invalid Pathname Existence or Creatability, Eh?

Testing the existence or creatability of possibly invalid pathnames is, given the above solution, mostly trivial. The little key here is to call the previously defined function before testing the passed path:

def is_path_creatable(pathname: str) -> bool:
    """
    `True` if the current user has sufficient permissions to create the passed
    pathname; `False` otherwise.
    """
    # Parent directory of the passed path. If empty, we substitute the current
    # working directory (CWD) instead.
    dirname = os.path.dirname(pathname) or os.getcwd()
    return os.access(dirname, os.W_OK)

def is_path_exists_or_creatable(pathname: str) -> bool:
    """
    `True` if the passed pathname is a valid pathname for the current OS _and_
    either currently exists or is hypothetically creatable; `False` otherwise.

    This function is guaranteed to _never_ raise exceptions.
    """
    try:
        # To prevent "os" module calls from raising undesirable exceptions on
        # invalid pathnames, is_pathname_valid() is explicitly called first.
        return is_pathname_valid(pathname) and (
            os.path.exists(pathname) or is_path_creatable(pathname))
    # Report failure on non-fatal filesystem complaints (e.g., connection
    # timeouts, permissions issues) implying this path to be inaccessible. All
    # other exceptions are unrelated fatal issues and should not be caught here.
    except OSError:
        return False

Done and done. Except not quite.

Question #3: Possibly Invalid Pathname Existence or Writability on Windows

There exists a caveat. Of course there does.

As the official os.access() documentation admits:

Note: I/O operations may fail even when os.access() indicates that they would succeed, particularly for operations on network filesystems which may have permissions semantics beyond the usual POSIX permission-bit model.

To no one"s surprise, Windows is the usual suspect here. Thanks to extensive use of Access Control Lists (ACL) on NTFS filesystems, the simplistic POSIX permission-bit model maps poorly to the underlying Windows reality. While this (arguably) isn"t Python"s fault, it might nonetheless be of concern for Windows-compatible applications.

If this is you, a more robust alternative is wanted. If the passed path does not exist, we instead attempt to create a temporary file guaranteed to be immediately deleted in the parent directory of that path – a more portable (if expensive) test of creatability:

import os, tempfile

def is_path_sibling_creatable(pathname: str) -> bool:
    """
    `True` if the current user has sufficient permissions to create **siblings**
    (i.e., arbitrary files in the parent directory) of the passed pathname;
    `False` otherwise.
    """
    # Parent directory of the passed path. If empty, we substitute the current
    # working directory (CWD) instead.
    dirname = os.path.dirname(pathname) or os.getcwd()

    try:
        # For safety, explicitly close and hence delete this temporary file
        # immediately after creating it in the passed path"s parent directory.
        with tempfile.TemporaryFile(dir=dirname): pass
        return True
    # While the exact type of exception raised by the above function depends on
    # the current version of the Python interpreter, all such types subclass the
    # following exception superclass.
    except EnvironmentError:
        return False

def is_path_exists_or_creatable_portable(pathname: str) -> bool:
    """
    `True` if the passed pathname is a valid pathname on the current OS _and_
    either currently exists or is hypothetically creatable in a cross-platform
    manner optimized for POSIX-unfriendly filesystems; `False` otherwise.

    This function is guaranteed to _never_ raise exceptions.
    """
    try:
        # To prevent "os" module calls from raising undesirable exceptions on
        # invalid pathnames, is_pathname_valid() is explicitly called first.
        return is_pathname_valid(pathname) and (
            os.path.exists(pathname) or is_path_sibling_creatable(pathname))
    # Report failure on non-fatal filesystem complaints (e.g., connection
    # timeouts, permissions issues) implying this path to be inaccessible. All
    # other exceptions are unrelated fatal issues and should not be caught here.
    except OSError:
        return False

Note, however, that even this may not be enough.

Thanks to User Access Control (UAC), the ever-inimicable Windows Vista and all subsequent iterations thereof blatantly lie about permissions pertaining to system directories. When non-Administrator users attempt to create files in either the canonical C:Windows or C:Windowssystem32 directories, UAC superficially permits the user to do so while actually isolating all created files into a "Virtual Store" in that user"s profile. (Who could have possibly imagined that deceiving users would have harmful long-term consequences?)

This is crazy. This is Windows.

Prove It

Dare we? It"s time to test-drive the above tests.

Since NULL is the only character prohibited in pathnames on UNIX-oriented filesystems, let"s leverage that to demonstrate the cold, hard truth – ignoring non-ignorable Windows shenanigans, which frankly bore and anger me in equal measure:

>>> print(""foo.bar" valid? " + str(is_pathname_valid("foo.bar")))
"foo.bar" valid? True
>>> print("Null byte valid? " + str(is_pathname_valid("x00")))
Null byte valid? False
>>> print("Long path valid? " + str(is_pathname_valid("a" * 256)))
Long path valid? False
>>> print(""/dev" exists or creatable? " + str(is_path_exists_or_creatable("/dev")))
"/dev" exists or creatable? True
>>> print(""/dev/foo.bar" exists or creatable? " + str(is_path_exists_or_creatable("/dev/foo.bar")))
"/dev/foo.bar" exists or creatable? False
>>> print("Null byte exists or creatable? " + str(is_path_exists_or_creatable("x00")))
Null byte exists or creatable? False

Beyond sanity. Beyond pain. You will find Python portability concerns.

Answer #4

How do I remove unwanted parts from strings in a column?

6 years after the original question was posted, pandas now has a good number of "vectorised" string functions that can succinctly perform these string manipulation operations.

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.


.str.replace

Specify the substring/pattern to match, and the substring to replace it with.

pd.__version__
# "0.24.1"

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df["result"] = df["result"].str.replace(r"D", "")
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

df["result"] = df["result"].str.replace(r"D", "").astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don"t want to modify df in-place, use DataFrame.assign:

df2 = df.assign(result=df["result"].str.replace(r"D", ""))
df
# Unchanged

.str.extract

Useful for extracting the substring(s) you want to keep.

df["result"] = df["result"].str.extract(r"(d+)", expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.


.str.split and .str.get

Splitting works assuming all your strings follow this consistent structure.

# df["result"] = df["result"].str.split(r"D").str[1]
df["result"] = df["result"].str.split(r"D").str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.


If you are satisfied with the succinct and readable str accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.


Optimizing: List Comprehensions

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

The str.replace option can be re-written using re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r"D")
df["result"] = [p.sub("", x) for x in df["result"]]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extract example can be re-written using a list comprehension with re.search,

p = re.compile(r"d+")
df["result"] = [p.search(x)[0] for x in df["result"]]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r"d+")
df["result"] = [try_extract(p, x) for x in df["result"]]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro"s and @MonkeyButter"s answers using list comprehensions:

df["result"] = [x.lstrip("+-").rstrip("aAbBcC") for x in df["result"]]

And,

df["result"] = [x[1:-1] for x in df["result"]]

Same rules for handling NaNs, etc, apply.


Performance Comparison

enter image description here

Graphs generated using perfplot. Full code listing, for your reference. The relevant functions are listed below.

Some of these comparisons are unfair because they take advantage of the structure of OP"s data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

Functions

def eumiro(df):
    return df.assign(
        result=df["result"].map(lambda x: x.lstrip("+-").rstrip("aAbBcC")))

def coder375(df):
    return df.assign(
        result=df["result"].replace(r"D", r"", regex=True))

def monkeybutter(df):
    return df.assign(result=df["result"].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df["result"].str.lstrip("+-").str.rstrip("aAbBcC"))

def cs1(df):
    return df.assign(result=df["result"].str.replace(r"D", ""))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou"s. so timing together.
    return df.assign(result=df["result"].str.extract(r"(d+)", expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub("", x) for x in df["result"]])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df["result"]])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip("+-").rstrip("aAbBcC") for x in df["result"]])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df["result"]])

Answer #5

Having a Text file content:

line 1
line 2
line 3

We can use this Python script in the same directory of the txt above

>>> with open("myfile.txt", encoding="utf-8") as file:
...     x = [l.rstrip("
") for l in file]
>>> x
["line 1","line 2","line 3"]

Using append:

x = []
with open("myfile.txt") as file:
    for l in file:
        x.append(l.strip())

Or:

>>> x = open("myfile.txt").read().splitlines()
>>> x
["line 1", "line 2", "line 3"]

Or:

>>> x = open("myfile.txt").readlines()
>>> x
["linea 1
", "line 2
", "line 3
"]

Or:

def print_output(lines_in_textfile):
    print("lines_in_textfile =", lines_in_textfile)

y = [x.rstrip() for x in open("001.txt")]
print_output(y)

with open("001.txt", "r", encoding="utf-8") as file:
    file = file.read().splitlines()
    print_output(file)

with open("001.txt", "r", encoding="utf-8") as file:
    file = [x.rstrip("
") for x in file]
    print_output(file)

output:

lines_in_textfile = ["line 1", "line 2", "line 3"]
lines_in_textfile = ["line 1", "line 2", "line 3"]
lines_in_textfile = ["line 1", "line 2", "line 3"]

Answer #6

To join all lines into a string and remove new lines, I normally use :

with open("t.txt") as f:
  s = " ".join([l.rstrip() for l in f]) 

Answer #7

This code will read the entire file into memory and remove all whitespace characters (newlines and spaces) from the end of each line:

with open(filename) as file:
    lines = file.readlines()
    lines = [line.rstrip() for line in lines]

If you"re working with a large file, then you should instead read it line-by-line:

with open(filename) as file:
    while (line := file.readline().rstrip()):
        print(line)

Depending on what you plan to do with your file and how it was encoded, you may also want to manually set the access mode and character encoding:

with open(filename, "r", encoding="UTF-8") as file:
    while (line := file.readline().rstrip()):
        print(line)

Answer #8

Try the method rstrip() (see doc Python 2 and Python 3)

>>> "test string
".rstrip()
"test string"

Python"s rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.

>>> "test string 
 


 

".rstrip()
"test string"

To strip only newlines:

>>> "test string 
 


 

".rstrip("
")
"test string 
 


 "

There are also the methods strip(), lstrip() and strip():

>>> s = "   

  
  abc   def 

  
  "
>>> s.strip()
"abc   def"
>>> s.lstrip()
"abc   def 

  
  "
>>> s.rstrip()
"   

  
  abc   def"

Answer #9

For whitespace on both sides use str.strip:

s = "  	 a string example	  "
s = s.strip()

For whitespace on the right side use rstrip:

s = s.rstrip()

For whitespace on the left side lstrip:

s = s.lstrip()

As thedz points out, you can provide an argument to strip arbitrary characters to any of these functions like this:

s = s.strip(" 	

")

This will strip any space, , , or characters from the left-hand side, right-hand side, or both sides of the string.

The examples above only remove strings from the left-hand and right-hand sides of strings. If you want to also remove characters from the middle of a string, try re.sub:

import re
print(re.sub("[s+]", "", s))

That should print out:

astringexample

Answer #10

See Input and Ouput:

with open("filename") as f:
    lines = f.readlines()

or with stripping the newline character:

with open("filename") as f:
    lines = [line.rstrip() for line in f]