  # sciPy function stats.scoreatpercentile () | python

NumPy | percentile | Python Methods and Functions | tile

The score at percentile = 50 is the median. If the required quantile lies between two data points, we interpolate them among ourselves according to the interpolation value.

Parameters:
arr: [array_like] input array.
per: [array_like] Percentile at which we need the score.
limit: [tuple] the lower and upper limits within which to compute the percentile.
axis: [int] axis along which we need to calculate the score.

Results: Score at Percentile relative to the array element.

Code # 1:

` `

` # scoreatpercentile from scipy import stats import nump y as np    # 1D array arr = [ 20 , 2 , 7 , 1 , 7 , 7 , 34 , 3 ]   print ( "arr:" , arr)    print ( "Score at 50th percentile: " ,    stats .scoreatpercentile (arr, 50 ))   print ( " Score at 90th percentile: " ,    stats .scoreatpercentile (arr, 90 ))   print ( " Score at 10th percentile: " ,    stats .scoreatpercentile (arr, 10 ))   < p> print ( "Score at 100th percentile:" ,  stats.scoreatpercentile (arr, 100 ))   print ( "Score at 30th percentile:" ,  stats.scoreatpercentile (arr, 30 )) `

` ` Exit :

` arr: [20, 2, 7, 1, 7, 7, 34, 3] Score at 50th percentile: 7.0 Score at 90th percentile: 24.2 Score at 10th percentile: 1.7 Score at 100th percentile: 34.0 Score at 30th percentile: 3.4 `

Code # 2:

` `

`  # scoreatpercentile from scipy import stats import numpy as np    arr = [[ 14 , 17 , 12 , 33 , 44 ],  [ 15 , 6 ,  27 , 8 , 19 ],    [ 23 , 2 , 54 , 1 , 4 ,]]     print ( " arr: " , arr)     print ( "Score at 50th percentile:" ,  stats.scoreatpercentile ( arr, 50 ))   print ( "Score at 50th percentile: " ,    stats.scoreatpercentile ( arr, 50 , axis = 1 ))   print ( "Score at 50th percentile:" ,    stats.scoreatpercentile (arr, 50 , axis = 0 )) `

` ` Exit:

` arr: [[14, 17, 12, 33, 44], [15 , 6, 27, 8, 19], [23, 2, 54, 1, 4]] Score at 50th percentile: 15.0 Score at 50th percentile: [17. 15. 4.] Score at 50th percentile: [15. 6 . 27. 8. 19.] `

## How do I calculate percentiles with python/numpy?

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?

I am looking for something similar to Excel"s percentile function.

I looked in NumPy"s statistics reference, and couldn"t find this. All I could find is the median (50th percentile), but not something more specific.

I think you"re almost there, try removing the extra square brackets around the `lst`"s (Also you don"t need to specify the column names when you"re creating a dataframe from a dict like this):

``````import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{"lst1Title": lst1,
"lst2Title": lst2,
"lst3Title": lst3
})

percentile_list
lst1Title  lst2Title  lst3Title
0          0         0         0
1          1         1         1
2          2         2         2
3          3         3         3
4          4         4         4
5          5         5         5
6          6         6         6
...
``````

If you need a more performant solution you can use `np.column_stack` rather than `zip` as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:

``````import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=["lst1Title", "lst2Title", "lst3Title"])
``````

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking `qcut` for quintiles.

So, when you ask for quintiles with `qcut`, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

``````pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6
``````

Conversely, for `cut` you will see something more uneven:

``````pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2
``````

That"s because `cut` will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you"ll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

If you want a histogram, you don"t need to attach any "names" to x-values, as on x-axis you would have data bins:

``````import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

np.random.seed(42)
x = np.random.normal(size=1000)

plt.hist(x, density=True, bins=30)  # density=False would make counts
plt.ylabel("Probability")
plt.xlabel("Data");
`````` Note, the number of `bins=30` was chosen arbitrarily, and there is Freedman‚ÄìDiaconis rule to be more scientific in choosing the "right" bin width: , where `IQR` is Interquartile range and `n` is total number of datapoints to plot

So, according to this rule one may calculate number of `bins` as:

``````q25, q75 = np.percentile(x,[.25,.75])
bin_width = 2*(q75 - q25)*len(x)**(-1/3)
bins = round((x.max() - x.min())/bin_width)
print("Freedman‚ÄìDiaconis number of bins:", bins)
plt.hist(x, bins = bins);
``````

``````Freedman‚ÄìDiaconis number of bins: 82
`````` And finally you can make your histogram a bit fancier with `PDF` line, titles, and legend:

``````import scipy.stats as st

plt.hist(x, density=True, bins=82, label="Data")
mn, mx = plt.xlim()
plt.xlim(mn, mx)
kde_xs = np.linspace(mn, mx, 300)
kde = st.gaussian_kde(x)
plt.plot(kde_xs, kde.pdf(kde_xs), label="PDF")
plt.legend(loc="upper left")
plt.ylabel("Probability")
plt.xlabel("Data")
plt.title("Histogram");
`````` However, if you have limited number of data points, like in OP, a bar plot would make more sense to represent your data. Then you may attach labels to x-axis:

``````x = np.arange(3)
plt.bar(x, height=[1,2,3])
plt.xticks(x, ["a","b","c"])
`````` You might be interested in the SciPy Stats package. It has the percentile function you"re after and many other statistical goodies.

`percentile()` is available in `numpy` too.

``````import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0
``````

This ticket leads me to believe they won"t be integrating `percentile()` into numpy anytime soon.

## How do I calculate percentiles with python/numpy?

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?

I am looking for something similar to Excel"s percentile function.

I looked in NumPy"s statistics reference, and couldn"t find this. All I could find is the median (50th percentile), but not something more specific.

You have four main options for converting types in pandas:

1. `to_numeric()` - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also `to_datetime()` and `to_timedelta()`.)

2. `astype()` - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

3. `infer_objects()` - a utility method to convert object columns holding Python objects to a pandas type if possible.

4. `convert_dtypes()` - convert DataFrame columns to the "best possible" dtype that supports `pd.NA` (pandas" object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.

# 1. `to_numeric()`

The best way to convert one or more columns of a DataFrame to numeric values is to use `pandas.to_numeric()`.

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

## Basic usage

The input to `to_numeric()` is a Series or a single column of a DataFrame.

``````>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64
``````

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

``````# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
``````

You can also use it to convert multiple columns of a DataFrame via the `apply()` method:

``````# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
``````

As long as your values can all be converted, that"s probably all you need.

## Error handling

But what if some values can"t be converted to a numeric type?

`to_numeric()` also takes an `errors` keyword argument that allows you to force non-numeric values to be `NaN`, or simply ignore columns containing these values.

Here"s an example using a Series of strings `s` which has the object dtype:

``````>>> s = pd.Series(["1", "2", "4.7", "pandas", "10"])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object
``````

The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas":

``````>>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise")
ValueError: Unable to parse string
``````

Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to `NaN` as follows using the `errors` keyword argument:

``````>>> pd.to_numeric(s, errors="coerce")
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64
``````

The third option for `errors` is just to ignore the operation if an invalid value is encountered:

``````>>> pd.to_numeric(s, errors="ignore")
# the original Series is returned untouched
``````

This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write:

``````df.apply(pd.to_numeric, errors="ignore")
``````

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

## Downcasting

By default, conversion with `to_numeric()` will give you either a `int64` or `float64` dtype (or whatever integer width is native to your platform).

That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like `float32`, or `int8`?

`to_numeric()` gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series `s` of integer type:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

Downcasting to "integer" uses the smallest possible integer that can hold the values:

``````>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2   -7
dtype: int8
``````

Downcasting to "float" similarly picks a smaller than normal floating type:

``````>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.0
2   -7.0
dtype: float32
``````

# 2. `astype()`

The `astype()` method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other.

## Basic usage

Just pick a type: you can use a NumPy dtype (e.g. `np.int16`), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and `astype()` will try and convert it for you:

``````# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype("category")
``````

Notice I said "try" - if `astype()` does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a `NaN` or `inf` value you"ll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing `errors="ignore"`. Your original object will be return untouched.

## Be careful

`astype()` is powerful, but it will sometimes convert values "incorrectly". For example:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

``````>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8
``````

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using `pd.to_numeric(s, downcast="unsigned")` instead could help prevent this error.

# 3. `infer_objects()`

Version 0.21.0 of pandas introduced the method `infer_objects()` for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

``````>>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object")
>>> df.dtypes
a    object
b    object
dtype: object
``````

Using `infer_objects()`, you can change the type of column "a" to int64:

``````>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object
``````

Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use `df.astype(int)` instead.

# 4. `convert_dtypes()`

Version 1.0 and above includes a method `convert_dtypes()` to convert Series and DataFrame columns to the best possible dtype that supports the `pd.NA` missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to `Int64`, a column of NumPy `int32` values will become the pandas dtype `Int32`.

With our `object` DataFrame `df`, we get the following result:

``````>>> df.convert_dtypes().dtypes
a     Int64
b    string
dtype: object
``````

Since column "a" held integer values, it was converted to the `Int64` type (which is capable of holding missing values, unlike `int64`).

Column "b" contained string objects, so was changed to pandas" `string` dtype.

By default, this method will infer the type from object values in each column. We can change this by passing `infer_objects=False`:

``````>>> df.convert_dtypes(infer_objects=False).dtypes
a    object
b    string
dtype: object
``````

Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran `infer_dtype`) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values.

## Placing the legend (`bbox_to_anchor`)

A legend is positioned inside the bounding box of the axes using the `loc` argument to `plt.legend`.
E.g. `loc="upper right"` places the legend in the upper right corner of the bounding box, which by default extents from `(0,0)` to `(1,1)` in axes coordinates (or in bounding box notation `(x0,y0, width, height)=(0,0,1,1)`).

To place the legend outside of the axes bounding box, one may specify a tuple `(x0,y0)` of axes coordinates of the lower left corner of the legend.

``````plt.legend(loc=(1.04,0))
``````

A more versatile approach is to manually specify the bounding box into which the legend should be placed, using the `bbox_to_anchor` argument. One can restrict oneself to supply only the `(x0, y0)` part of the bbox. This creates a zero span box, out of which the legend will expand in the direction given by the `loc` argument. E.g.

`plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")`

places the legend outside the axes, such that the upper left corner of the legend is at position `(1.04,1)` in axes coordinates.

Further examples are given below, where additionally the interplay between different arguments like `mode` and `ncols` are shown. ``````l1 = plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
l2 = plt.legend(bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
l3 = plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
l4 = plt.legend(bbox_to_anchor=(0,1.02,1,0.2), loc="lower left",
l5 = plt.legend(bbox_to_anchor=(1,0), loc="lower right",
bbox_transform=fig.transFigure, ncol=3)
l6 = plt.legend(bbox_to_anchor=(0.4,0.8), loc="upper right")
``````

Details about how to interpret the 4-tuple argument to `bbox_to_anchor`, as in `l4`, can be found in this question. The `mode="expand"` expands the legend horizontally inside the bounding box given by the 4-tuple. For a vertically expanded legend, see this question.

Sometimes it may be useful to specify the bounding box in figure coordinates instead of axes coordinates. This is shown in the example `l5` from above, where the `bbox_transform` argument is used to put the legend in the lower left corner of the figure.

### Postprocessing

Having placed the legend outside the axes often leads to the undesired situation that it is completely or partially outside the figure canvas.

Solutions to this problem are:

One can adjust the subplot parameters such, that the axes take less space inside the figure (and thereby leave more space to the legend) by using `plt.subplots_adjust`. E.g.

``````  plt.subplots_adjust(right=0.7)
``````

leaves 30% space on the right-hand side of the figure, where one could place the legend.

• Tight layout
Using `plt.tight_layout` Allows to automatically adjust the subplot parameters such that the elements in the figure sit tight against the figure edges. Unfortunately, the legend is not taken into account in this automatism, but we can supply a rectangle box that the whole subplots area (including labels) will fit into.

``````  plt.tight_layout(rect=[0,0,0.75,1])
``````
• Saving the figure with `bbox_inches = "tight"`
The argument `bbox_inches = "tight"` to `plt.savefig` can be used to save the figure such that all artist on the canvas (including the legend) are fit into the saved area. If needed, the figure size is automatically adjusted.

``````  plt.savefig("output.png", bbox_inches="tight")
``````
• automatically adjusting the subplot params
A way to automatically adjust the subplot position such that the legend fits inside the canvas without changing the figure size can be found in this answer: Creating figure with exact size and no padding (and legend outside the axes)

Comparison between the cases discussed above: ## Alternatives

A figure legend

One may use a legend to the figure instead of the axes, `matplotlib.figure.Figure.legend`. This has become especially useful for matplotlib version >=2.1, where no special arguments are needed

``````fig.legend(loc=7)
``````

to create a legend for all artists in the different axes of the figure. The legend is placed using the `loc` argument, similar to how it is placed inside an axes, but in reference to the whole figure - hence it will be outside the axes somewhat automatically. What remains is to adjust the subplots such that there is no overlap between the legend and the axes. Here the point "Adjust the subplot parameters" from above will be helpful. An example:

``````import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi)
colors=["#7aa0c4","#ca82e1" ,"#8bcd50","#e18882"]
fig, axes = plt.subplots(ncols=2)
for i in range(4):
axes[i//2].plot(x,np.sin(x+i), color=colors[i],label="y=sin(x+{})".format(i))

fig.legend(loc=7)
fig.tight_layout()
plt.show()
`````` Legend inside dedicated subplot axes

An alternative to using `bbox_to_anchor` would be to place the legend in its dedicated subplot axes (`lax`). Since the legend subplot should be smaller than the plot, we may use `gridspec_kw={"width_ratios":[4,1]}` at axes creation. We can hide the axes `lax.axis("off")` but still put a legend in. The legend handles and labels need to obtained from the real plot via `h,l = ax.get_legend_handles_labels()`, and can then be supplied to the legend in the `lax` subplot, `lax.legend(h,l)`. A complete example is below.

``````import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = 6,2

fig, (ax,lax) = plt.subplots(ncols=2, gridspec_kw={"width_ratios":[4,1]})
ax.plot(x,y, label="y=sin(x)")
....

h,l = ax.get_legend_handles_labels()
lax.axis("off")

plt.tight_layout()
plt.show()
``````

This produces a plot, which is visually pretty similar to the plot from above: We could also use the first axes to place the legend, but use the `bbox_transform` of the legend axes,

``````ax.legend(bbox_to_anchor=(0,0,1,1), bbox_transform=lax.transAxes)
lax.axis("off")
``````

In this approach, we do not need to obtain the legend handles externally, but we need to specify the `bbox_to_anchor` argument.

• Consider the matplotlib legend guide with some examples of other stuff you want to do with legends.
• Some example code for placing legends for pie charts may directly be found in answer to this question: Python - Legend overlaps with the pie chart
• The `loc` argument can take numbers instead of strings, which make calls shorter, however, they are not very intuitively mapped to each other. Here is the mapping for reference: I think you"re almost there, try removing the extra square brackets around the `lst`"s (Also you don"t need to specify the column names when you"re creating a dataframe from a dict like this):

``````import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{"lst1Title": lst1,
"lst2Title": lst2,
"lst3Title": lst3
})

percentile_list
lst1Title  lst2Title  lst3Title
0          0         0         0
1          1         1         1
2          2         2         2
3          3         3         3
4          4         4         4
5          5         5         5
6          6         6         6
...
``````

If you need a more performant solution you can use `np.column_stack` rather than `zip` as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:

``````import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=["lst1Title", "lst2Title", "lst3Title"])
``````

To somewhat expand on the earlier answers here, there are a number of details which are commonly overlooked.

• Prefer `subprocess.run()` over `subprocess.check_call()` and friends over `subprocess.call()` over `subprocess.Popen()` over `os.system()` over `os.popen()`
• Understand and probably use `text=True`, aka `universal_newlines=True`.
• Understand the meaning of `shell=True` or `shell=False` and how it changes quoting and the availability of shell conveniences.
• Understand differences between `sh` and Bash
• Understand how a subprocess is separate from its parent, and generally cannot change the parent.
• Avoid running the Python interpreter as a subprocess of Python.

These topics are covered in some more detail below.

# Prefer `subprocess.run()` or `subprocess.check_call()`

The `subprocess.Popen()` function is a low-level workhorse but it is tricky to use correctly and you end up copy/pasting multiple lines of code ... which conveniently already exist in the standard library as a set of higher-level wrapper functions for various purposes, which are presented in more detail in the following.

Here"s a paragraph from the documentation:

The recommended approach to invoking subprocesses is to use the `run()` function for all use cases it can handle. For more advanced use cases, the underlying `Popen` interface can be used directly.

Unfortunately, the availability of these wrapper functions differs between Python versions.

• `subprocess.run()` was officially introduced in Python 3.5. It is meant to replace all of the following.
• `subprocess.check_output()` was introduced in Python 2.7 / 3.1. It is basically equivalent to `subprocess.run(..., check=True, stdout=subprocess.PIPE).stdout`
• `subprocess.check_call()` was introduced in Python 2.5. It is basically equivalent to `subprocess.run(..., check=True)`
• `subprocess.call()` was introduced in Python 2.4 in the original `subprocess` module (PEP-324). It is basically equivalent to `subprocess.run(...).returncode`

### High-level API vs `subprocess.Popen()`

The refactored and extended `subprocess.run()` is more logical and more versatile than the older legacy functions it replaces. It returns a `CompletedProcess` object which has various methods which allow you to retrieve the exit status, the standard output, and a few other results and status indicators from the finished subprocess.

`subprocess.run()` is the way to go if you simply need a program to run and return control to Python. For more involved scenarios (background processes, perhaps with interactive I/O with the Python parent program) you still need to use `subprocess.Popen()` and take care of all the plumbing yourself. This requires a fairly intricate understanding of all the moving parts and should not be undertaken lightly. The simpler `Popen` object represents the (possibly still-running) process which needs to be managed from your code for the remainder of the lifetime of the subprocess.

It should perhaps be emphasized that just `subprocess.Popen()` merely creates a process. If you leave it at that, you have a subprocess running concurrently alongside with Python, so a "background" process. If it doesn"t need to do input or output or otherwise coordinate with you, it can do useful work in parallel with your Python program.

### Avoid `os.system()` and `os.popen()`

Since time eternal (well, since Python 2.5) the `os` module documentation has contained the recommendation to prefer `subprocess` over `os.system()`:

The `subprocess` module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

The problems with `system()` are that it"s obviously system-dependent and doesn"t offer ways to interact with the subprocess. It simply runs, with standard output and standard error outside of Python"s reach. The only information Python receives back is the exit status of the command (zero means success, though the meaning of non-zero values is also somewhat system-dependent).

PEP-324 (which was already mentioned above) contains a more detailed rationale for why `os.system` is problematic and how `subprocess` attempts to solve those issues.

`os.popen()` used to be even more strongly discouraged:

Deprecated since version 2.6: This function is obsolete. Use the `subprocess` module.

However, since sometime in Python 3, it has been reimplemented to simply use `subprocess`, and redirects to the `subprocess.Popen()` documentation for details.

### Understand and usually use `check=True`

You"ll also notice that `subprocess.call()` has many of the same limitations as `os.system()`. In regular use, you should generally check whether the process finished successfully, which `subprocess.check_call()` and `subprocess.check_output()` do (where the latter also returns the standard output of the finished subprocess). Similarly, you should usually use `check=True` with `subprocess.run()` unless you specifically need to allow the subprocess to return an error status.

In practice, with `check=True` or `subprocess.check_*`, Python will throw a `CalledProcessError` exception if the subprocess returns a nonzero exit status.

A common error with `subprocess.run()` is to omit `check=True` and be surprised when downstream code fails if the subprocess failed.

On the other hand, a common problem with `check_call()` and `check_output()` was that users who blindly used these functions were surprised when the exception was raised e.g. when `grep` did not find a match. (You should probably replace `grep` with native Python code anyway, as outlined below.)

All things counted, you need to understand how shell commands return an exit code, and under what conditions they will return a non-zero (error) exit code, and make a conscious decision how exactly it should be handled.

# Understand and probably use `text=True` aka `universal_newlines=True`

Since Python 3, strings internal to Python are Unicode strings. But there is no guarantee that a subprocess generates Unicode output, or strings at all.

(If the differences are not immediately obvious, Ned Batchelder"s Pragmatic Unicode is recommended, if not outright obligatory, reading. There is a 36-minute video presentation behind the link if you prefer, though reading the page yourself will probably take significantly less time.)

Deep down, Python has to fetch a `bytes` buffer and interpret it somehow. If it contains a blob of binary data, it shouldn"t be decoded into a Unicode string, because that"s error-prone and bug-inducing behavior - precisely the sort of pesky behavior which riddled many Python 2 scripts, before there was a way to properly distinguish between encoded text and binary data.

With `text=True`, you tell Python that you, in fact, expect back textual data in the system"s default encoding, and that it should be decoded into a Python (Unicode) string to the best of Python"s ability (usually UTF-8 on any moderately up to date system, except perhaps Windows?)

If that"s not what you request back, Python will just give you `bytes` strings in the `stdout` and `stderr` strings. Maybe at some later point you do know that they were text strings after all, and you know their encoding. Then, you can decode them.

``````normal = subprocess.run([external, arg],
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
check=True,
text=True)
print(normal.stdout)

convoluted = subprocess.run([external, arg],
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
check=True)
# You have to know (or guess) the encoding
print(convoluted.stdout.decode("utf-8"))
``````

Python 3.7 introduced the shorter and more descriptive and understandable alias `text` for the keyword argument which was previously somewhat misleadingly called `universal_newlines`.

# Understand `shell=True` vs `shell=False`

With `shell=True` you pass a single string to your shell, and the shell takes it from there.

With `shell=False` you pass a list of arguments to the OS, bypassing the shell.

When you don"t have a shell, you save a process and get rid of a fairly substantial amount of hidden complexity, which may or may not harbor bugs or even security problems.

On the other hand, when you don"t have a shell, you don"t have redirection, wildcard expansion, job control, and a large number of other shell features.

A common mistake is to use `shell=True` and then still pass Python a list of tokens, or vice versa. This happens to work in some cases, but is really ill-defined and could break in interesting ways.

``````# XXX AVOID THIS BUG
buggy = subprocess.run("dig +short stackoverflow.com")

# XXX AVOID THIS BUG TOO
broken = subprocess.run(["dig", "+short", "stackoverflow.com"],
shell=True)

# XXX DEFINITELY AVOID THIS
pathological = subprocess.run(["dig +short stackoverflow.com"],
shell=True)

correct = subprocess.run(["dig", "+short", "stackoverflow.com"],
# Probably don"t forget these, too
check=True, text=True)

# XXX Probably better avoid shell=True
# but this is nominally correct
fixed_but_fugly = subprocess.run("dig +short stackoverflow.com",
shell=True,
# Probably don"t forget these, too
check=True, text=True)
``````

The common retort "but it works for me" is not a useful rebuttal unless you understand exactly under what circumstances it could stop working.

### Refactoring Example

Very often, the features of the shell can be replaced with native Python code. Simple Awk or `sed` scripts should probably simply be translated to Python instead.

To partially illustrate this, here is a typical but slightly silly example which involves many shell features.

``````cmd = """while read -r x;
do ping -c 3 "\$x" | grep "round-trip min/avg/max"
done <hosts.txt"""

# Trivial but horrible
results = subprocess.run(
cmd, shell=True, universal_newlines=True, check=True)
print(results.stdout)

# Reimplement with shell=False
with open("hosts.txt") as hosts:
for host in hosts:
host = host.rstrip("
")  # drop newline
ping = subprocess.run(
["ping", "-c", "3", host],
text=True,
stdout=subprocess.PIPE,
check=True)
for line in ping.stdout.split("
"):
if "round-trip min/avg/max" in line:
print("{}: {}".format(host, line))
``````

Some things to note here:

• With `shell=False` you don"t need the quoting that the shell requires around strings. Putting quotes anyway is probably an error.
• It often makes sense to run as little code as possible in a subprocess. This gives you more control over execution from within your Python code.
• Having said that, complex shell pipelines are tedious and sometimes challenging to reimplement in Python.

The refactored code also illustrates just how much the shell really does for you with a very terse syntax -- for better or for worse. Python says explicit is better than implicit but the Python code is rather verbose and arguably looks more complex than this really is. On the other hand, it offers a number of points where you can grab control in the middle of something else, as trivially exemplified by the enhancement that we can easily include the host name along with the shell command output. (This is by no means challenging to do in the shell, either, but at the expense of yet another diversion and perhaps another process.)

### Common Shell Constructs

For completeness, here are brief explanations of some of these shell features, and some notes on how they can perhaps be replaced with native Python facilities.

• Globbing aka wildcard expansion can be replaced with `glob.glob()` or very often with simple Python string comparisons like `for file in os.listdir("."): if not file.endswith(".png"): continue`. Bash has various other expansion facilities like `.{png,jpg}` brace expansion and `{1..100}` as well as tilde expansion (`~` expands to your home directory, and more generally `~account` to the home directory of another user)
• Shell variables like `\$SHELL` or `\$my_exported_var` can sometimes simply be replaced with Python variables. Exported shell variables are available as e.g. `os.environ["SHELL"]` (the meaning of `export` is to make the variable available to subprocesses -- a variable which is not available to subprocesses will obviously not be available to Python running as a subprocess of the shell, or vice versa. The `env=` keyword argument to `subprocess` methods allows you to define the environment of the subprocess as a dictionary, so that"s one way to make a Python variable visible to a subprocess). With `shell=False` you will need to understand how to remove any quotes; for example, `cd "\$HOME"` is equivalent to `os.chdir(os.environ["HOME"])` without quotes around the directory name. (Very often `cd` is not useful or necessary anyway, and many beginners omit the double quotes around the variable and get away with it until one day ...)
• Redirection allows you to read from a file as your standard input, and write your standard output to a file. `grep "foo" <inputfile >outputfile` opens `outputfile` for writing and `inputfile` for reading, and passes its contents as standard input to `grep`, whose standard output then lands in `outputfile`. This is not generally hard to replace with native Python code.
• Pipelines are a form of redirection. `echo foo | nl` runs two subprocesses, where the standard output of `echo` is the standard input of `nl` (on the OS level, in Unix-like systems, this is a single file handle). If you cannot replace one or both ends of the pipeline with native Python code, perhaps think about using a shell after all, especially if the pipeline has more than two or three processes (though look at the `pipes` module in the Python standard library or a number of more modern and versatile third-party competitors).
• Job control lets you interrupt jobs, run them in the background, return them to the foreground, etc. The basic Unix signals to stop and continue a process are of course available from Python, too. But jobs are a higher-level abstraction in the shell which involve process groups etc which you have to understand if you want to do something like this from Python.
• Quoting in the shell is potentially confusing until you understand that everything is basically a string. So `ls -l /` is equivalent to `"ls" "-l" "/"` but the quoting around literals is completely optional. Unquoted strings which contain shell metacharacters undergo parameter expansion, whitespace tokenization and wildcard expansion; double quotes prevent whitespace tokenization and wildcard expansion but allow parameter expansions (variable substitution, command substitution, and backslash processing). This is simple in theory but can get bewildering, especially when there are several layers of interpretation (a remote shell command, for example).

# Understand differences between `sh` and Bash

`subprocess` runs your shell commands with `/bin/sh` unless you specifically request otherwise (except of course on Windows, where it uses the value of the `COMSPEC` variable). This means that various Bash-only features like arrays, `[[` etc are not available.

If you need to use Bash-only syntax, you can pass in the path to the shell as `executable="/bin/bash"` (where of course if your Bash is installed somewhere else, you need to adjust the path).

``````subprocess.run("""
# This for loop syntax is Bash only
for((i=1;i<=\$#;i++)); do
# Arrays are Bash-only
array[i]+=123
done""",
shell=True, check=True,
executable="/bin/bash")
``````

# A `subprocess` is separate from its parent, and cannot change it

A somewhat common mistake is doing something like

``````subprocess.run("cd /tmp", shell=True)
subprocess.run("pwd", shell=True)  # Oops, doesn"t print /tmp
``````

The same thing will happen if the first subprocess tries to set an environment variable, which of course will have disappeared when you run another subprocess, etc.

A child process runs completely separate from Python, and when it finishes, Python has no idea what it did (apart from the vague indicators that it can infer from the exit status and output from the child process). A child generally cannot change the parent"s environment; it cannot set a variable, change the working directory, or, in so many words, communicate with its parent without cooperation from the parent.

The immediate fix in this particular case is to run both commands in a single subprocess;

``````subprocess.run("cd /tmp; pwd", shell=True)
``````

though obviously this particular use case isn"t very useful; instead, use the `cwd` keyword argument, or simply `os.chdir()` before running the subprocess. Similarly, for setting a variable, you can manipulate the environment of the current process (and thus also its children) via

``````os.environ["foo"] = "bar"
``````

or pass an environment setting to a child process with

``````subprocess.run("echo "\$foo"", shell=True, env={"foo": "bar"})
``````

(not to mention the obvious refactoring `subprocess.run(["echo", "bar"])`; but `echo` is a poor example of something to run in a subprocess in the first place, of course).

# Don"t run Python from Python

This is slightly dubious advice; there are certainly situations where it does make sense or is even an absolute requirement to run the Python interpreter as a subprocess from a Python script. But very frequently, the correct approach is simply to `import` the other Python module into your calling script and call its functions directly.

If the other Python script is under your control, and it isn"t a module, consider turning it into one. (This answer is too long already so I will not delve into details here.)

If you need parallelism, you can run Python functions in subprocesses with the `multiprocessing` module. There is also `threading` which runs multiple tasks in a single process (which is more lightweight and gives you more control, but also more constrained in that threads within a process are tightly coupled, and bound to a single GIL.)

TL;DR

``````def square_list(n):
the_list = []                         # Replace
for x in range(n):
y = x * x
the_list.append(y)                # these
return the_list                       # lines
``````

# do this:

``````def square_yield(n):
for x in range(n):
y = x * x
yield y                           # with this one.
``````

Whenever you find yourself building a list from scratch, `yield` each piece instead.

This was my first "aha" moment with yield.

`yield` is a sugary way to say

build a series of stuff

Same behavior:

``````>>> for square in square_list(4):
...     print(square)
...
0
1
4
9
>>> for square in square_yield(4):
...     print(square)
...
0
1
4
9
``````

Different behavior:

Yield is single-pass: you can only iterate through once. When a function has a yield in it we call it a generator function. And an iterator is what it returns. Those terms are revealing. We lose the convenience of a container, but gain the power of a series that"s computed as needed, and arbitrarily long.

Yield is lazy, it puts off computation. A function with a yield in it doesn"t actually execute at all when you call it. It returns an iterator object that remembers where it left off. Each time you call `next()` on the iterator (this happens in a for-loop) execution inches forward to the next yield. `return` raises StopIteration and ends the series (this is the natural end of a for-loop).

Yield is versatile. Data doesn"t have to be stored all together, it can be made available one at a time. It can be infinite.

``````>>> def squares_all_of_them():
...     x = 0
...     while True:
...         yield x * x
...         x += 1
...
>>> squares = squares_all_of_them()
>>> for _ in range(4):
...     print(next(squares))
...
0
1
4
9
``````

If you need multiple passes and the series isn"t too long, just call `list()` on it:

``````>>> list(square_yield(4))
[0, 1, 4, 9]
``````

Brilliant choice of the word `yield` because both meanings apply:

yield — produce or provide (as in agriculture)

...provide the next data in the series.

yield — give way or relinquish (as in political power)

...relinquish CPU execution until the iterator advances.

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking `qcut` for quintiles.

So, when you ask for quintiles with `qcut`, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

``````pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6
``````

Conversely, for `cut` you will see something more uneven:

``````pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2
``````

That"s because `cut` will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you"ll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

If you want a histogram, you don"t need to attach any "names" to x-values, as on x-axis you would have data bins:

``````import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

np.random.seed(42)
x = np.random.normal(size=1000)

plt.hist(x, density=True, bins=30)  # density=False would make counts
plt.ylabel("Probability")
plt.xlabel("Data");
`````` Note, the number of `bins=30` was chosen arbitrarily, and there is Freedman‚ÄìDiaconis rule to be more scientific in choosing the "right" bin width: , where `IQR` is Interquartile range and `n` is total number of datapoints to plot

So, according to this rule one may calculate number of `bins` as:

``````q25, q75 = np.percentile(x,[.25,.75])
bin_width = 2*(q75 - q25)*len(x)**(-1/3)
bins = round((x.max() - x.min())/bin_width)
print("Freedman‚ÄìDiaconis number of bins:", bins)
plt.hist(x, bins = bins);
``````

``````Freedman‚ÄìDiaconis number of bins: 82
`````` And finally you can make your histogram a bit fancier with `PDF` line, titles, and legend:

``````import scipy.stats as st

plt.hist(x, density=True, bins=82, label="Data")
mn, mx = plt.xlim()
plt.xlim(mn, mx)
kde_xs = np.linspace(mn, mx, 300)
kde = st.gaussian_kde(x)
plt.plot(kde_xs, kde.pdf(kde_xs), label="PDF")
plt.legend(loc="upper left")
plt.ylabel("Probability")
plt.xlabel("Data")
plt.title("Histogram");
`````` However, if you have limited number of data points, like in OP, a bar plot would make more sense to represent your data. Then you may attach labels to x-axis:

``````x = np.arange(3)
plt.bar(x, height=[1,2,3])
plt.xticks(x, ["a","b","c"])
`````` To save some folks some time, here is a list I extracted from a small corpus. I do not know if it is complete, but it should have most (if not all) of the help definitions from upenn_tagset...

CC: conjunction, coordinating

``````& "n and both but either et for less minus neither nor or plus so
therefore times v. versus vs. whether yet
``````

CD: numeral, cardinal

``````mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
seven 1987 twenty "79 zero two 78-degrees eighty-four IX "60s .025
fifteen 271,124 dozen quintillion DM2,000 ...
``````

DT: determiner

``````all an another any both del each either every half la many much nary
neither no some such that the them these this those
``````

EX: existential there

``````there
``````

IN: preposition or conjunction, subordinating

``````astride among upon whether out inside pro despite on by throughout
below within for towards near behind atop around if like until below
next into if beside ...
``````

``````third ill-mannered pre-war regrettable oiled calamitous first separable
ectoplasmic battery-powered participatory fourth still-to-be-named
multilingual multi-disciplinary ...
``````

``````bleaker braver breezier briefer brighter brisker broader bumper busier
calmer cheaper choosier cleaner clearer closer colder commoner costlier
cozier creamier crunchier cuter ...
``````

``````calmest cheapest choicest classiest cleanest clearest closest commonest
corniest costliest crassest creepiest crudest cutest darkest deadliest
dearest deepest densest dinkiest ...
``````

LS: list item marker

``````A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
SP-44007 Second Third Three Two * a b c d first five four one six three
two
``````

MD: modal auxiliary

``````can cannot could couldn"t dare may might must need ought shall should
shouldn"t will would
``````

NN: noun, common, singular or mass

``````common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
``````

NNP: noun, proper, singular

``````Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...
``````

NNS: noun, common, plural

``````undergraduates scotches bric-a-brac products bodyguards facets coasts
divestitures storehouses designs clubs fragrances averages
subjectivists apprehensions muses factory-jobs ...
``````

PDT: pre-determiner

``````all both half many quite such sure this
``````

POS: genitive marker

``````" "s
``````

PRP: pronoun, personal

``````hers herself him himself hisself it itself me myself one oneself ours
ourselves ownself self she thee theirs them themselves they thou thy us
``````

PRP\$: pronoun, possessive

``````her his mine my our ours their thy your
``````

``````occasionally unabatingly maddeningly adventurously professedly
stirringly prominently technologically magisterially predominately
swiftly fiscally pitilessly ...
``````

``````further gloomier grander graver greater grimmer harder harsher
healthier heavier higher however larger later leaner lengthier less-
perfectly lesser lonelier longer louder lower more ...
``````

``````best biggest bluntest earliest farthest first furthest hardest
heartiest highest largest least less most nearest second tightest worst
``````

RP: particle

``````aboard about across along apart around aside at away back before behind
by crop down ever fast for forth from go high i.e. in into just later
low more off on open out over per pie raising start teeth that through
under unto up up-pp upon whole with you
``````

TO: "to" as preposition or infinitive marker

``````to
``````

UH: interjection

``````Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
man baby diddle hush sonuvabitch ...
``````

VB: verb, base form

``````ask assemble assess assign assume atone attention avoid bake balkanize
bank begin behold believe bend benefit bevel beware bless boil bomb
boost brace break bring broil brush build ...
``````

VBD: verb, past tense

``````dipped pleaded swiped regummed soaked tidied convened halted registered
cushioned exacted snubbed strode aimed adopted belied figgered
speculated wore appreciated contemplated ...
``````

VBG: verb, present participle or gerund

``````telegraphing stirring focusing angering judging stalling lactating
hankerin" alleging veering capping approaching traveling besieging
encrypting interrupting erasing wincing ...
``````

VBN: verb, past participle

``````multihulled dilapidated aerosolized chaired languished panelized used
experimented flourished imitated reunifed factored condensed sheared
unsettled primed dubbed desired ...
``````

VBP: verb, present tense, not 3rd person singular

``````predominate wrap resort sue twist spill cure lengthen brush terminate
appear tend stray glisten obtain comprise detest tease attract
emphasize mold postpone sever return wag ...
``````

VBZ: verb, present tense, 3rd person singular

``````bases reconstructs marks mixes displeases seals carps weaves snatches
slumps stretches authorizes smolders pictures emerges stockpiles
seduces fizzes uses bolsters slaps speaks pleads ...
``````

WDT: WH-determiner

``````that what whatever which whichever
``````

WP: WH-pronoun

``````that what whatever whatsoever which who whom whosoever
``````

``````how however whence whenever where whereby whereever wherein whereof why
``````

For each of your dataframe column, you could get quantile with:

``````q = df["col"].quantile(0.99)
``````

and then filter with:

``````df[df["col"] < q]
``````

If one need to remove lower and upper outliers, combine condition with an AND statement:

``````q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]
``````

Python is a very versatile language. You may print variables by different methods. I have listed below five methods. You may use them according to your convenience.

Example:

``````a = 1
b = "ball"
``````

Method 1:

``````print("I have %d %s" % (a, b))
``````

Method 2:

``````print("I have", a, b)
``````

Method 3:

``````print("I have {} {}".format(a, b))
``````

Method 4:

``````print("I have " + str(a) + " " + b)
``````

Method 5:

``````print(f"I have {a} {b}")
``````

The output would be:

``````I have 1 ball
``````