NLP | Python Methods and Functions | readability | String Variables

This article illustrates the various traditional readability formulas available to estimate readability score. Natural language processing sometimes requires analyzing words and sentences to determine the complexity of the text. Readability metrics — these are, as a rule, the grading levels on specific scales that rate the text in relation to the complexity of that particular text. It helps the author to improve the text to make it understandable for a wider audience, which makes the content attractive.

Various methods available for determining the Readabilty / Formaulae score: —

1) Dale - Challa formula
2) Gunning fog formula
4) McLaughlin SMOG formula
5) FORECAST formula
7) Flash Points

The implementation of the readability formulas is shown below.
Dale Chall's formula

To apply the formula:

Select multiple 100 word swatches throughout the text.
Calculate the average length of a sentence in words (divide the number of words by the number of sentences).
Calculate the percentage of words NOT in Dale-Chall's 3000 simple word list.
Calculate this equation

` Raw score = 0.1579 * (PDW) + 0.0496 * (ASL) + 3.6365 Here, PDW = Percentage of difficult words not on the Dale – Chall word list. ASL = Average sentence length `

Gunning Mist Formula

` Grade level = 0.4 * ((average sentence length) + (percentage of Hard Words)) Here, Hard Words = words with more than two syllables. `

Smog Formula

` SMOG grading = 3 + √ (polysyllable count). Here, polysyllable count = number of words of more than two syllables in a sample of 30 sentences. `

Flash Formula

` Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW) Here, ASL = average sentence length (number of words divided by number of sentences) ASW = average word length in syllables (number of syllables divided by number of words) `

1 Readability formulas measure the level of readership must be in order to read a given text. Thus, the author of the text receives much-needed information to reach his target audience.

2. Know in advance if the target audience can understand your content.

3 . Easy to use.

4. Readable text attracts more audience.

1. Due to With many readability formulas, there is an increasing likelihood of getting wide variations in the results of the same text.

2. Applies math to literature, which is not always a good idea.

3. Can't measure complexity words or phrases to determine exactly where to fix them.

 ` import ` ` spacy ` ` from ` ` textstat.textstat ` ` import ` ` textstatistics, easy_wo rd_set, legacy_round `   ` # Splits text into sentences using ` ` Segmentation of the Spacy proposal that can ` ` # can be found at https://spacy.io/usage/spacy-101 ` ` def ` ` break_sentences (text): ` ` nlp ` ` = ` ` spacy.load (` ` 'en' ` `) ` ` ` ` doc ` ` = ` ` nlp (text) ` ` return ` ` doc.sents `   ` # Returns the number of words in the text ` ` def ` ` word_count (text) : ` ` sentences ` ` = ` ` break_sentences (text) ` ` words ` ` = ` ` 0 ` ` for ` ` sentence ` ` in ` ` sentences: ` ` words ` ` + ` ` = ` ` len ` ` ([token ` ` for ` ` token ` ` in ` ` sentence]) ` ` ` ` return ` ` words `   ` # Returns the number of sentences in the text ` ` def ` ` sentence_count (text): ` ` sentences ` ` = ` ` break_sentences (text) ` ` return ` ` len ` ` (sentences) `   ` # Returns the average sentence length ` ` def ` ` avg_sentence_length (text): ` ` words ` ` = ` ` word_count (text) `` sentences = sentence_count (text) average_sentence_length = float (words / sentences) return average_sentence_length   # Textstat is a Python package for calculating statistics # text to determine readability, # complexity and level of configuration of a particular corpus. # The package can be found at https:// pypi .python.org / pypi / textstat def syllables_count (word): return textstatistics (). syllable_count ( word)   # Returns the average number of syllables per # word in the text def avg_syllables_per_word (text) : syllable = syllables_count (text) words = word_count (text) ASPW = flo at (syllable) / float (words) return legacy_round (ASPW, 1 )    # Return the total number of compound words in the text def difficult_words (text):   # Find all words in the text   words = []   sentences = break_sentences (text) for sentence in sentences: words + = [ str (token) for token in sentence]   # compound words are those that have syllables & gt; = 2 # easy_word_set is provided by Textstat as # list of common words diff_words_set = set ()   for word in words: syllable_count = syllables_count (word) if word not in easy_ word_set and syllable_count & gt; = 2 : diff_words_set.add (word)   return len (diff_words_set)   # A word is polysyllabic if it has more than 3 syllables # this function returns the count of all such words # present in the text def poly_syllable_count (text): count = 0 words = [] sentences = break_sentences (text) for sentence in sentences: words + = [token for token in sentence]     for wo rd in words: syllable_count = syllables_count (word) if syllable_count & gt; = 3 :   count + = 1 return count     def flesch_reading_ease (text): "" " Implements Flesch Formula: Ease of reading = 206.835 - (1.015 × ASL) - (84.6 × ASW) Here, ASL = average sentence length (number of words   divided by the number of sentences) ASW = average word length in syllables (number of syllables divided by the number of words) "" "   FRE = 206.835 - float ( 1.015 * avg_sentence_length (text)) - float ( 84.6 * avg_syllables_per_word (text)) return legacy_round (FRE, 2 )     def gunning_fog (text ): per_diff_words = (difficult_words (text) / word_count (text) * 100 ) + 5 grade = 0.4 * (avg_sentence_length (text) + per_diff_words) return grade     def smog_index (text): "" "   Implements SMOG Formula / Grading SMOG grade = 3 +? Here, number of multi-word words = number of words more   than two syllables in a sample of 30 sentences. "" "   if sentence_count (text) & gt; = 3 : poly_syllab = poly_syllable_count (text) ``  ` ` SMOG ` ` = ` ` (` ` 1.043 ` ` * ` ` (` ` 30 ` ` * ` ` (poly_syllab ` ` / ` ` sentence_count (text))) ` ` * ` ` * ` ` 0.5 ` `) ` ` + ` ` 3.1291 ` ` return ` ` legacy_round (SMOG, ` ` 1 ` `) ` ` else ` `: ` ` return ` ` 0 ` ` `    ` def ` ` dale_chall_readability_score (text): ` ` “” ”` ` ` ` Implements the Dale Challe Formula: ` ` Raw invoice = 0.1579 * (PDW) + 0.0496 * (ASL) + 3.6365 ` ` Here, ` ` ` ` PDW = percentage of difficult words. ` ` ASL = average sentence length ` ` "" "` ` ` ` words ` ` = ` ` word_count (text) ` ` ` ` # Number of words that are not called difficult ` ` ` ` count ` ` = ` ` word_count ` ` - ` ` difficult_words (text) ` ` if ` ` words & gt; ` ` 0 ` `: `   ` # Percentage of words not in the list of difficult words `   ` per ` ` = ` ` float ` ` (count) ` ` / ` ` float ` ` (words) ` ` * ` ` 100 `   ` # diff_words stores the percentage of difficult words ` ` diff_words ` ` = ` ` 100 ` ` - ` ` per ` ` `  ` raw_score ` ` = ` ` (` ` 0.1579 ` ` * ` ` diff_words) ` ` + ` ` ` ` ` ` (` ` 0.0496 ` ` * ` ` avg_sentence_length (text)) ` ` `  ` ` ` # If the percentage of difficult words is more than 5%, then; ` ` # Adjusted grade = Raw grade + 3.6365, ` ` ` ` # otherwise adjusted grade = raw estimate ` ` `  ` if ` ` diff_words & gt; ` ` 5 ` `: `   ` raw_score ` ` + ` ` = ` ` 3.6365 `   ` return ` ` legacy_round (score, ` ` 2 ` `) `

## Python Readability Index (NLP): StackOverflow Questions

I would suggest reading PEP 483 and PEP 484 and watching this presentation by Guido on type hinting.

In a nutshell: Type hinting is literally what the words mean. You hint the type of the object(s) you"re using.

Due to the dynamic nature of Python, inferring or checking the type of an object being used is especially hard. This fact makes it hard for developers to understand what exactly is going on in code they haven"t written and, most importantly, for type checking tools found in many IDEs (PyCharm and PyDev come to mind) that are limited due to the fact that they don"t have any indicator of what type the objects are. As a result they resort to trying to infer the type with (as mentioned in the presentation) around 50% success rate.

To take two important slides from the type hinting presentation:

### Why type hints?

1. Helps type checkers: By hinting at what type you want the object to be the type checker can easily detect if, for instance, you"re passing an object with a type that isn"t expected.
2. Helps with documentation: A third person viewing your code will know what is expected where, ergo, how to use it without getting them `TypeErrors`.
3. Helps IDEs develop more accurate and robust tools: Development Environments will be better suited at suggesting appropriate methods when know what type your object is. You have probably experienced this with some IDE at some point, hitting the `.` and having methods/attributes pop up which aren"t defined for an object.

### Why use static type checkers?

• Find bugs sooner: This is self-evident, I believe.
• The larger your project the more you need it: Again, makes sense. Static languages offer a robustness and control that dynamic languages lack. The bigger and more complex your application becomes the more control and predictability (from a behavioral aspect) you require.
• Large teams are already running static analysis: I"m guessing this verifies the first two points.

As a closing note for this small introduction: This is an optional feature and, from what I understand, it has been introduced in order to reap some of the benefits of static typing.

You generally do not need to worry about it and definitely don"t need to use it (especially in cases where you use Python as an auxiliary scripting language). It should be helpful when developing large projects as it offers much needed robustness, control and additional debugging capabilities.

## Type hinting with mypy:

In order to make this answer more complete, I think a little demonstration would be suitable. I"ll be using `mypy`, the library which inspired Type Hints as they are presented in the PEP. This is mainly written for anybody bumping into this question and wondering where to begin.

Before I do that let me reiterate the following: PEP 484 doesn"t enforce anything; it is simply setting a direction for function annotations and proposing guidelines for how type checking can/should be performed. You can annotate your functions and hint as many things as you want; your scripts will still run regardless of the presence of annotations because Python itself doesn"t use them.

Anyways, as noted in the PEP, hinting types should generally take three forms:

• Function annotations (PEP 3107).
• Stub files for built-in/user modules.
• Special ` # type: type` comments that complement the first two forms. (See: What are variable annotations? for a Python 3.6 update for `# type: type` comments)

Additionally, you"ll want to use type hints in conjunction with the new `typing` module introduced in `Py3.5`. In it, many (additional) ABCs (abstract base classes) are defined along with helper functions and decorators for use in static checking. Most ABCs in `collections.abc` are included, but in a generic form in order to allow subscription (by defining a `__getitem__()` method).

For anyone interested in a more in-depth explanation of these, the `mypy documentation` is written very nicely and has a lot of code samples demonstrating/describing the functionality of their checker; it is definitely worth a read.

### Function annotations and special comments:

First, it"s interesting to observe some of the behavior we can get when using special comments. Special `# type: type` comments can be added during variable assignments to indicate the type of an object if one cannot be directly inferred. Simple assignments are generally easily inferred but others, like lists (with regard to their contents), cannot.

Note: If we want to use any derivative of containers and need to specify the contents for that container we must use the generic types from the `typing` module. These support indexing.

``````# Generic List, supports indexing.
from typing import List

# In this case, the type is easily inferred as type: int.
i = 0

# Even though the type can be inferred as of type list
# there is no way to know the contents of this list.
# By using type: List[str] we indicate we want to use a list of strings.
a = []  # type: List[str]

# Appending an int to our list
# is statically not correct.
a.append(i)

# Appending a string is fine.
a.append("i")

print(a)  # [0, "i"]
``````

If we add these commands to a file and execute them with our interpreter, everything works just fine and `print(a)` just prints the contents of list `a`. The `# type` comments have been discarded, treated as plain comments which have no additional semantic meaning.

By running this with `mypy`, on the other hand, we get the following response:

``````(Python3)[email protected]: mypy typeHintsCode.py
typesInline.py:14: error: Argument 1 to "append" of "list" has incompatible type "int"; expected "str"
``````

Indicating that a list of `str` objects cannot contain an `int`, which, statically speaking, is sound. This can be fixed by either abiding to the type of `a` and only appending `str` objects or by changing the type of the contents of `a` to indicate that any value is acceptable (Intuitively performed with `List[Any]` after `Any` has been imported from `typing`).

Function annotations are added in the form `param_name : type` after each parameter in your function signature and a return type is specified using the `-> type` notation before the ending function colon; all annotations are stored in the `__annotations__` attribute for that function in a handy dictionary form. Using a trivial example (which doesn"t require extra types from the `typing` module):

``````def annotated(x: int, y: str) -> bool:
return x < y
``````

The `annotated.__annotations__` attribute now has the following values:

``````{"y": <class "str">, "return": <class "bool">, "x": <class "int">}
``````

If we"re a complete newbie, or we are familiar with Python¬†2.7 concepts and are consequently unaware of the `TypeError` lurking in the comparison of `annotated`, we can perform another static check, catch the error and save us some trouble:

``````(Python3)[email protected]: mypy typeHintsCode.py
typeFunction.py: note: In function "annotated":
typeFunction.py:2: error: Unsupported operand types for > ("str" and "int")
``````

Among other things, calling the function with invalid arguments will also get caught:

``````annotated(20, 20)

# mypy complains:
typeHintsCode.py:4: error: Argument 2 to "annotated" has incompatible type "int"; expected "str"
``````

These can be extended to basically any use case and the errors caught extend further than basic calls and operations. The types you can check for are really flexible and I have merely given a small sneak peak of its potential. A look in the `typing` module, the PEPs or the `mypy` documentation will give you a more comprehensive idea of the capabilities offered.

### Stub files:

Stub files can be used in two different non mutually exclusive cases:

• You need to type check a module for which you do not want to directly alter the function signatures
• You want to write modules and have type-checking but additionally want to separate annotations from content.

What stub files (with an extension of `.pyi`) are is an annotated interface of the module you are making/want to use. They contain the signatures of the functions you want to type-check with the body of the functions discarded. To get a feel of this, given a set of three random functions in a module named `randfunc.py`:

``````def message(s):
print(s)

def alterContents(myIterable):
return [i for i in myIterable if i % 2 == 0]

def combine(messageFunc, itFunc):
messageFunc("Printing the Iterable")
a = alterContents(range(1, 20))
return set(a)
``````

We can create a stub file `randfunc.pyi`, in which we can place some restrictions if we wish to do so. The downside is that somebody viewing the source without the stub won"t really get that annotation assistance when trying to understand what is supposed to be passed where.

Anyway, the structure of a stub file is pretty simplistic: Add all function definitions with empty bodies (`pass` filled) and supply the annotations based on your requirements. Here, let"s assume we only want to work with `int` types for our Containers.

``````# Stub for randfucn.py
from typing import Iterable, List, Set, Callable

def message(s: str) -> None: pass

def alterContents(myIterable: Iterable[int])-> List[int]: pass

def combine(
messageFunc: Callable[[str], Any],
itFunc: Callable[[Iterable[int]], List[int]]
)-> Set[int]: pass
``````

The `combine` function gives an indication of why you might want to use annotations in a different file, they some times clutter up the code and reduce readability (big no-no for Python). You could of course use type aliases but that sometime confuses more than it helps (so use them wisely).

This should get you familiarized with the basic concepts of type hints in Python. Even though the type checker used has been `mypy` you should gradually start to see more of them pop-up, some internally in IDEs (PyCharm,) and others as standard Python modules.

I"ll try and add additional checkers/related packages in the following list when and if I find them (or if suggested).

Checkers I know of:

• Mypy: as described here.
• PyType: By Google, uses different notation from what I gather, probably worth a look.

Related Packages/Projects:

• typeshed: Official Python repository housing an assortment of stub files for the standard library.

The `typeshed` project is actually one of the best places you can look to see how type hinting might be used in a project of your own. Let"s take as an example the `__init__` dunders of the `Counter` class in the corresponding `.pyi` file:

``````class Counter(Dict[_T, int], Generic[_T]):
def __init__(self) -> None: ...
def __init__(self, Mapping: Mapping[_T, int]) -> None: ...
def __init__(self, iterable: Iterable[_T]) -> None: ...
``````

Where `_T = TypeVar("_T")` is used to define generic classes. For the `Counter` class we can see that it can either take no arguments in its initializer, get a single `Mapping` from any type to an `int` or take an `Iterable` of any type.

Notice: One thing I forgot to mention was that the `typing` module has been introduced on a provisional basis. From PEP 411:

A provisional package may have its API modified prior to "graduating" into a "stable" state. On one hand, this state provides the package with the benefits of being formally part of the Python distribution. On the other hand, the core development team explicitly states that no promises are made with regards to the the stability of the package"s API, which may change for the next release. While it is considered an unlikely outcome, such packages may even be removed from the standard library without a deprecation period if the concerns regarding their API or maintenance prove well-founded.

So take things here with a pinch of salt; I"m doubtful it will be removed or altered in significant ways, but one can never know.

** Another topic altogether, but valid in the scope of type-hints: `PEP 526`: Syntax for Variable Annotations is an effort to replace `# type` comments by introducing new syntax which allows users to annotate the type of variables in simple `varname: type` statements.

See What are variable annotations?, as previously mentioned, for a small introduction to these.

I think you"re almost there, try removing the extra square brackets around the `lst`"s (Also you don"t need to specify the column names when you"re creating a dataframe from a dict like this):

``````import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{"lst1Title": lst1,
"lst2Title": lst2,
"lst3Title": lst3
})

percentile_list
lst1Title  lst2Title  lst3Title
0          0         0         0
1          1         1         1
2          2         2         2
3          3         3         3
4          4         4         4
5          5         5         5
6          6         6         6
...
``````

If you need a more performant solution you can use `np.column_stack` rather than `zip` as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:

``````import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=["lst1Title", "lst2Title", "lst3Title"])
``````

Python 3 - UPDATED 18th November 2015

Found the accepted answer useful, yet wished to expand on several points for the benefit of others based on my own experiences.

Module: A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended.

Module Example: Assume we have a single python script in the current directory, here I am calling it mymodule.py

The file mymodule.py contains the following code:

``````def myfunc():
print("Hello!")
``````

If we run the python3 interpreter from the current directory, we can import and run the function myfunc in the following different ways (you would typically just choose one of the following):

``````>>> import mymodule
>>> mymodule.myfunc()
Hello!
>>> from mymodule import myfunc
>>> myfunc()
Hello!
>>> from mymodule import *
>>> myfunc()
Hello!
``````

Ok, so that was easy enough.

Now assume you have the need to put this module into its own dedicated folder to provide a module namespace, instead of just running it ad-hoc from the current working directory. This is where it is worth explaining the concept of a package.

Package: Packages are a way of structuring Python‚Äôs module namespace by using ‚Äúdotted module names‚Äù. For example, the module name A.B designates a submodule named B in a package named A. Just like the use of modules saves the authors of different modules from having to worry about each other‚Äôs global variable names, the use of dotted module names saves the authors of multi-module packages like NumPy or the Python Imaging Library from having to worry about each other‚Äôs module names.

Package Example: Let"s now assume we have the following folder and files. Here, mymodule.py is identical to before, and __init__.py is an empty file:

``````.
‚îî‚îÄ‚îÄ mypackage
‚îú‚îÄ‚îÄ __init__.py
‚îî‚îÄ‚îÄ mymodule.py
``````

The __init__.py files are required to make Python treat the directories as containing packages. For further information, please see the Modules documentation link provided later on.

Our current working directory is one level above the ordinary folder called mypackage

``````\$ ls
mypackage
``````

If we run the python3 interpreter now, we can import and run the module mymodule.py containing the required function myfunc in the following different ways (you would typically just choose one of the following):

``````>>> import mypackage
>>> from mypackage import mymodule
>>> mymodule.myfunc()
Hello!
>>> import mypackage.mymodule
>>> mypackage.mymodule.myfunc()
Hello!
>>> from mypackage import mymodule
>>> mymodule.myfunc()
Hello!
>>> from mypackage.mymodule import myfunc
>>> myfunc()
Hello!
>>> from mypackage.mymodule import *
>>> myfunc()
Hello!
``````

Assuming Python 3, there is excellent documentation at: Modules

In terms of naming conventions for packages and modules, the general guidelines are given in PEP-0008 - please see Package and Module Names

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

# What is the best way of implementing a reverse function for strings?

My own experience with this question is academic. However, if you"re a pro looking for the quick answer, use a slice that steps by `-1`:

``````>>> "a string"[::-1]
"gnirts a"
``````

or more readably (but slower due to the method name lookups and the fact that join forms a list when given an iterator), `str.join`:

``````>>> "".join(reversed("a string"))
"gnirts a"
``````

or for readability and reusability, put the slice in a function

``````def reversed_string(a_string):
return a_string[::-1]
``````

and then:

``````>>> reversed_string("a_string")
"gnirts_a"
``````

## Longer explanation

There is no built-in reverse function in Python"s str object.

Here is a couple of things about Python"s strings you should know:

1. In Python, strings are immutable. Changing a string does not modify the string. It creates a new one.

2. Strings are sliceable. Slicing a string gives you a new string from one point in the string, backwards or forwards, to another point, by given increments. They take slice notation or a slice object in a subscript:

``````string[subscript]
``````

The subscript creates a slice by including a colon within the braces:

``````    string[start:stop:step]
``````

To create a slice outside of the braces, you"ll need to create a slice object:

``````    slice_obj = slice(start, stop, step)
string[slice_obj]
``````

While `"".join(reversed("foo"))` is readable, it requires calling a string method, `str.join`, on another called function, which can be rather relatively slow. Let"s put this in a function - we"ll come back to it:

``````def reverse_string_readable_answer(string):
return "".join(reversed(string))
``````

## Most performant approach:

Much faster is using a reverse slice:

``````"foo"[::-1]
``````

But how can we make this more readable and understandable to someone less familiar with slices or the intent of the original author? Let"s create a slice object outside of the subscript notation, give it a descriptive name, and pass it to the subscript notation.

``````start = stop = None
step = -1
reverse_slice = slice(start, stop, step)
"foo"[reverse_slice]
``````

## Implement as Function

To actually implement this as a function, I think it is semantically clear enough to simply use a descriptive name:

``````def reversed_string(a_string):
return a_string[::-1]
``````

And usage is simply:

``````reversed_string("foo")
``````

## What your teacher probably wants:

If you have an instructor, they probably want you to start with an empty string, and build up a new string from the old one. You can do this with pure syntax and literals using a while loop:

``````def reverse_a_string_slowly(a_string):
new_string = ""
index = len(a_string)
while index:
index -= 1                    # index = index - 1
new_string += a_string[index] # new_string = new_string + character
return new_string
``````

This is theoretically bad because, remember, strings are immutable - so every time where it looks like you"re appending a character onto your `new_string`, it"s theoretically creating a new string every time! However, CPython knows how to optimize this in certain cases, of which this trivial case is one.

## Best Practice

Theoretically better is to collect your substrings in a list, and join them later:

``````def reverse_a_string_more_slowly(a_string):
new_strings = []
index = len(a_string)
while index:
index -= 1
new_strings.append(a_string[index])
return "".join(new_strings)
``````

However, as we will see in the timings below for CPython, this actually takes longer, because CPython can optimize the string concatenation.

## Timings

Here are the timings:

``````>>> a_string = "amanaplanacanalpanama" * 10
10.38789987564087
>>> min(timeit.repeat(lambda: reversed_string(a_string)))
0.6622700691223145
>>> min(timeit.repeat(lambda: reverse_a_string_slowly(a_string)))
25.756799936294556
>>> min(timeit.repeat(lambda: reverse_a_string_more_slowly(a_string)))
38.73570013046265
``````

CPython optimizes string concatenation, whereas other implementations may not:

... do not rely on CPython"s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b . This optimization is fragile even in CPython (it only works for some types) and isn"t present at all in implementations that don"t use refcounting. In performance sensitive parts of the library, the "".join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

I"ve compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it"s useful anyway.

Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which"ll save a lot of space but cost some load time.

If portability is an issue, binary is better than npy. If human readability is important, then you"ll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).

More details and the code are available at the github repo.

# For pandas >= 0.25

The functionality to name returned aggregate columns has been reintroduced in the master branch and is targeted for pandas 0.25. The new syntax is `.agg(new_col_name=("col_name", "agg_func")`. Detailed example from the PR linked above:

``````In [2]: df = pd.DataFrame({"kind": ["cat", "dog", "cat", "dog"],
...:                    "height": [9.1, 6.0, 9.5, 34.0],
...:                    "weight": [7.9, 7.5, 9.9, 198.0]})
...:

In [3]: df
Out[3]:
kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [4]: df.groupby("kind").agg(min_height=("height", "min"),
max_weight=("weight", "max"))
Out[4]:
min_height  max_weight
kind
cat          9.1         9.9
dog          6.0       198.0
``````

It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier (below) as per this PR. Again, copying from the example in the PR:

``````In [2]: df = pd.DataFrame({"A": ["a", "a"], "B": [1, 2], "C": [3, 4]})

In [3]: df.groupby("A").agg({"B": [lambda x: 0, lambda x: 1]})
Out[3]:
B
<lambda> <lambda 1>
A
a        0          1
``````

and then `.rename()`, or in one go:

``````In [4]: df.groupby("A").agg(b=("B", lambda x: 0), c=("B", lambda x: 1))
Out[4]:
b  c
A
a  0  0
``````

# For pandas < 0.25

The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.

Series:

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

DataFrames:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.

``````# Create a sample data frame
df = pd.DataFrame({"A": [1, 1, 1, 2, 2],
"B": range(5),
"C": range(5)})

# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby("A").B.agg({"foo": "count"})
# Recommended replacement syntax
df.groupby("A").B.agg(["count"]).rename(columns={"count": "foo"})

# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby("A").agg({"B": {"foo": "sum"}, "C": {"bar": "min"}})
# Recommended replacement syntax
df.groupby("A").agg({"B": "sum", "C": "min"}).rename(columns={"B": "foo", "C": "bar"})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby("A")
.agg({"B": "sum", "C": "min"})
.rename(columns={"B": "foo", "C": "bar"})
)
``````

### Update 2017-01-03 in response to @JunkMechanic"s comment.

With the old style dictionary syntax, it was possible to pass multiple `lambda` functions to `.agg`, since these would be renamed with the key in the passed dictionary:

``````>>> df.groupby("A").agg({"B": {"min": lambda x: x.min(), "max": lambda x: x.max()}})

B
max min
A
1   2   0
2   4   3
``````

Multiple functions can also be passed to a single column as a list:

``````>>> df.groupby("A").agg({"B": [np.min, np.max]})

B
amin amax
A
1    0    2
2    3    4
``````

However, this does not work with lambda functions, since they are anonymous and all return `<lambda>`, which causes a name collision:

``````>>> df.groupby("A").agg({"B": [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>
``````

To avoid the `SpecificationError`, named functions can be defined a priori instead of using `lambda`. Suitable function names also avoid calling `.rename` on the data frame afterwards. These functions can be passed with the same list syntax as above:

``````>>> def my_min(x):
>>>     return x.min()

>>> def my_max(x):
>>>     return x.max()

>>> df.groupby("A").agg({"B": [my_min, my_max]})

B
my_min my_max
A
1      0      2
2      3      4
``````

If we look at the Zen of Python, emphasis mine:

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Special cases aren"t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you"re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it"s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let"s do more of those!

The most Pythonic solution is the one that is clearest, simplest, and easiest to explain:

``````a + b == c or a + c == b or b + c == a
``````

Even better, you don"t even need to know Python to understand this code! It"s that easy. This is, without reservation, the best solution. Anything else is intellectual masturbation.

Furthermore, this is likely the best performing solution as well, as it is the only one out of all the proposals that short circuits. If `a + b == c`, only a single addition and comparison is done.

You do not need to call `d.keys()`, so

``````if key not in d:
d[key] = value
``````

is enough. There is no clearer, more readable method.

You could update again with `dict.get()`, which would return an existing value if the key is already present:

``````d[key] = d.get(key, value)
``````

but I strongly recommend against this; this is code golfing, hindering maintenance and readability.

TLDR; No, `for` loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

1. When your data is small (...depending on what you"re doing),
2. When dealing with `object`/mixed dtypes
3. When using the `str`/regex accessor functions

Let"s examine these situations individually.

### Iteration v/s Vectorization on Small Data

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

1. Index/axis alignment
2. Handling mixed datatypes
3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, `Series.add`), while it is more pronounced for string functions (for example, `Series.str.replace`).

`for` loops, on the other hand, are faster then you think. What"s even better is list comprehensions (which create lists through `for` loops) are even faster as they are optimized iterative mechanisms for list creation.

``````[f(x) for x in seq]
``````

Where `seq` is a pandas series or DataFrame column. Or, when operating over multiple columns,

``````[f(x, y) for x, y in zip(seq1, seq2)]
``````

Where `seq1` and `seq2` are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against `Series.ne` (`!=`) and `query`. Here are the functions:

``````# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp
``````

For simplicity, I have used the `perfplot` package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms `query` for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

``````df[df.A.values != df.B.values]
``````

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example - this time, with another vanilla python construct that is faster than a for loop - `collections.Counter`. A common requirement is to compute the value counts and return the result as a dictionary. This is done with `value_counts`, `np.unique`, and `Counter`:

``````# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter
``````

The results are more pronounced, `Counter` wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The `Counter` is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a `for` loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don"t give you the performance you need, there is always cython and numba. Let"s add this test into the mix.

``````from numba import njit, prange

@njit(parallel=True)
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]

return np.array(result)

``````

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.

### Operations with Mixed/`object` dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

``````# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp
``````

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: `map` and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

``````# Dictionary value extraction.
ser.map(operator.itemgetter("value"))     # map
pd.Series([x.get("value") for x in ser])  # list comprehension
``````

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), `map`, `str.get` accessor method, and the list comprehension:

``````# List positional indexing.
def get_0th(lst):
try:
return lst[0]
# Handle empty lists and NaNs gracefully.
except (IndexError, TypeError):
return np.nan
``````

``````ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe
``````

Note
If the index matters, you would want to do:

``````pd.Series([...], index=ser.index)
``````

When reconstructing the series.

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

``````# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp
``````

Both `itertools.chain.from_iterable` and the nested list comprehension are pure python constructs, and scale much better than the `stack` solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed `apply` on these solutions, because it would skew the graph (yes, it"s that slow).

### Regex Operations, and `.str` Accessor Methods

Pandas can apply regex operations such as `str.contains`, `str.extract`, and `str.extractall`, as well as other "vectorized" string operations (such as `str.split`, str.find`,`str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with `re.compile` (also see Is it worth using Python's re.compile?). The list comp equivalent to `str.contains` looks something like this:

``````p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
``````

Or,

``````ser2 = ser[[bool(p.search(x)) for x in ser]]
``````

If you need to handle NaNs, you can do something like

``````ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
``````

The list comp equivalent to `str.extract` (without groups) will look something like:

``````df["col2"] = [p.search(x).group(0) for x in df["col"]]
``````

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

``````def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan

df["col2"] = [matcher(x) for x in df["col"]]
``````

The `matcher` function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the `group` or `groups` attribute of the matcher object.

For `str.extractall`, change `p.search` to `p.findall`.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

``````# Extracting strings.
p = re.compile(r"(?<=[A-Z])(d{4})")
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan

ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension
``````

More Examples
Full disclosure - I am the author (in part or whole) of these posts listed below.

### Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via `.values` as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example `df[df.A.values != df.B.values]` would show instant performance boosts over `df[df.A != df.B]`. Using `.values` may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it"s up to you to decide whether these solutions are worth the trouble of implementing.

### Appendix: Code Snippets

``````import perfplot
import operator
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain
``````

``````# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"]),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=["vectorized !=", "query (numexpr)", "list comp", "numba"],
n_range=[2**k for k in range(0, 15)],
xlabel="N"
)
``````

``````# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=["value_counts", "np.unique", "Counter"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=lambda x, y: dict(x) == dict(y)
)
``````

``````# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=["vectorized !=", "query (numexpr)", "list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
``````

``````# Dictionary value extraction.
ser1 = pd.Series([{"key": "abc", "value": 123}, {"key": "xyz", "value": 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter("value")),
lambda ser: pd.Series([x.get("value") for x in ser]),
],
labels=["map", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
``````

``````# List positional indexing.
ser2 = pd.Series([["a", "b", "c"], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=["map", "str accessor", "list comprehension", "list comp safe"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
``````

``````# Nested list flattening.
ser3 = pd.Series([["a", "b", "c"], ["d", "e"], ["f", "g"]])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=["stack", "itertools.chain", "nested list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None

)
``````

``````# Extracting strings.
ser4 = pd.Series(["foo xyz", "test A1234", "D3345 xtz"])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=["str.extract", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
``````

The answer is no, but you can use `collections.OrderedDict` from the Python standard library with just keys (and values as `None`) for the same purpose.

Update: As of Python 3.7 (and CPython 3.6), standard `dict` is guaranteed to preserve order and is more performant than `OrderedDict`. (For backward compatibility and especially readability, however, you may wish to continue using `OrderedDict`.)

Here"s an example of how to use `dict` as an ordered set to filter out duplicate items while preserving order, thereby emulating an ordered set. Use the `dict` class method `fromkeys()` to create a dict, then simply ask for the `keys()` back.

``````>>> keywords = ["foo", "bar", "bar", "foo", "baz", "foo"]

>>> list(dict.fromkeys(keywords))
["foo", "bar", "baz"]
``````