# Python | Find the sublist with the maximum value in a given nested list

find | Python Methods and Functions | sublist

Examples :

`  Input:  [['Paras', 90], [' Jain', 32], ['Geeks', 120], ['for', 338], [' Labs', 532]]  Output:  ['Labs', 532]  Input:  [[' Geek', 90], ['For', 32], [' Geeks', 120]]  Output:  ['Geeks', 120] `

Method # 1: Using Lambda

 ` # Python code to find the maximum value ` ` # in the second column of the list box `   ` # Initialize the input list ` ` Input ` ` = ` ` [[` `' Paras' ` `, ` ` 90 ` `], [` `' Jain' ` `, ` ` 32 ` `], [` ` 'Geeks' ` `, ` ` 120 ` `], ` ` ` ` [ ` ` 'for' ` `, ` ` 338 ` `], [` `' Labs' ` `, ` ` 532 ` `]] ` ` # Using lambda ` ` Output ` ` = ` ` max ` ` (` ` Input ` `, key ` ` = ` ` lambda ` ` x: x [` ` 1 ` `]) `    ` # printout ` ` print ` ` (` ` "Input List is:" ` `, ` ` Input ` `) ` ` print ` ` (` ` "Output list is:" ` `, Output) `

Output:

Input List is: [['Paras ', 90], [' Jain ', 32], [' Geeks', 120], ['for', 338], ['Labs', 532]]
Output list is: [' Labs', 532]

Method # 2: Using itemgetter

` `

 ` # Python code to find the maximum values ​​` ` # in the second column of the list box `    ` # Import ` ` import ` ` operator `   ` # Initialize the input list ` ` Input ` ` = ` ` [[` ` 'Paras' ` `, ` ` 90 ` ` ], [` ` 'Jain' ` `, ` ` 32 ` `], [` ` 'Geeks' ` `, ` ` 120 ` `], ` ` [` ` 'for' ` `, ` ` 338 ` ` ], [` ` 'Labs' ` `, ` ` 532 ` `]] ` ` # Using itemgetter ` ` Output ` ` = ` ` max ` ` (` ` Input ` `, key ` ` = ` ` operator.itemgetter (` ` 1 ` `)) `   ` # Printout ` ` print ` ` (` `" Input List is: "` `, ` ` Input ` `) ` ` print ` ` (` ` "Output list is:" ` `, Output) `
` `

` ` Exit:

Input List is: [['Paras', 90], ['Jain', 32], ['Geeks', 120], ['for', 338], ['Labs', 532]]
Output list is: ['Labs', 532]

## Finding the index of an item in a list

Given a list `["foo", "bar", "baz"]` and an item in the list `"bar"`, how do I get its index (`1`) in Python?

## Find current directory and file"s directory

In Python, what commands can I use to find:

1. the current directory (where I was in the terminal when I ran the Python script), and
2. where the file I am executing is?

## How to find if directory exists in Python

In the `os` module in Python, is there a way to find if a directory exists, something like:

``````>>> os.direxists(os.path.join(os.getcwd()), "new_folder")) # in pseudocode
True/False
``````

## How do I find the location of my Python site-packages directory?

### Question by Daryl Spitzer

How do I find the location of my site-packages directory?

## Find all files in a directory with extension .txt in Python

How can I find all the files in a directory having the extension `.txt` in python?

## Find which version of package is installed with pip

Using pip, is it possible to figure out which version of a package is currently installed?

I know about `pip install XYZ --upgrade` but I am wondering if there is anything like `pip info XYZ`. If not what would be the best way to tell what version I am currently using.

## error: Unable to find vcvarsall.bat

I tried to install the Python package dulwich:

``````pip install dulwich
``````

But I get a cryptic error message:

``````error: Unable to find vcvarsall.bat
``````

The same happens if I try installing the package manually:

``````> python setup.py install
running build_ext
building "dulwich._objects" extension
error: Unable to find vcvarsall.bat
``````

## How to use glob() to find files recursively?

This is what I have:

``````glob(os.path.join("src","*.c"))
``````

but I want to search the subfolders of src. Something like this would work:

``````glob(os.path.join("src","*.c"))
glob(os.path.join("src","*","*.c"))
glob(os.path.join("src","*","*","*.c"))
glob(os.path.join("src","*","*","*","*.c"))
``````

But this is obviously limited and clunky.

## Python: Find in list

I have come across this:

``````item = someSortOfSelection()
if item in myList:
doMySpecialFunction(item)
``````

but sometimes it does not work with all my items, as if they weren"t recognized in the list (when it"s a list of string).

Is this the most "pythonic" way of finding an item in a list: `if x in l:`?

## How to find out the number of CPUs using python

I want to know the number of CPUs on the local machine using Python. The result should be `user/real` as output by `time(1)` when called with an optimally scaling userspace-only program.

## How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "`iter`" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use `DataFrame.to_string()`.

Do you want to compute something? In that case, search for methods in this order (list modified from here):

1. Vectorization
2. Cython routines
3. List Comprehensions (vanilla `for` loop)
4. `DataFrame.apply()`: i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space
5. `DataFrame.itertuples()` and `iteritems()`
6. `DataFrame.iterrows()`

`iterrows` and `itertuples` (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". `df.iterrows()` is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.

## Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

## Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

``````# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]
``````

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
2. When dealing with mixed data types you should iterate over `zip(df["A"], df["B"], ...)` instead of `df[["A", "B"]].to_numpy()` as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, `to_numpy()` will cast the entire array to string, which may not be what you want. Fortunately `zip`ping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.

## An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns `A + B`. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer `vec` over `vec_numpy`).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.

## Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls `iterrows()` while doing something inside a `for` loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

# In Python, what is the purpose of `__slots__` and what are the cases one should avoid this?

## TLDR:

The special attribute `__slots__` allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

1. faster attribute access.
2. space savings in memory.

The space savings is from

1. Storing value references in slots instead of `__dict__`.
2. Denying `__dict__` and `__weakref__` creation if parent classes deny them and you declare `__slots__`.

### Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

``````class Base:
__slots__ = "foo", "bar"

class Right(Base):
__slots__ = "baz",

class Wrong(Base):
__slots__ = "foo", "bar", "baz"        # redundant foo and bar
``````

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

``````>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)
``````

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

``````>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"
``````

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

### Requirements:

• To have attributes named in `__slots__` to actually be stored in slots instead of a `__dict__`, a class must inherit from `object` (automatic in Python 3, but must be explicit in Python 2).

• To prevent the creation of a `__dict__`, you must inherit from `object` and all classes in the inheritance must declare `__slots__` and none of them can have a `"__dict__"` entry.

There are a lot of details if you wish to keep reading.

## Why use `__slots__`: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created `__slots__` for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

``````import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
def get_set_delete():
obj.foo = "foo"
obj.foo
del obj.foo
return get_set_delete
``````

and

``````>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085
``````

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

``````>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342
``````

In Python 2 on Windows I have measured it about 15% faster.

## Why use `__slots__`: Memory Savings

Another purpose of `__slots__` is to reduce the space in memory that each object instance takes up.

The space saved over using `__dict__` can be significant.

SQLAlchemy attributes a lot of memory savings to `__slots__`.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with `guppy.hpy` (aka heapy) and `sys.getsizeof`, the size of a class instance without `__slots__` declared, and nothing else, is 64 bytes. That does not include the `__dict__`. Thank you Python for lazy evaluation again, the `__dict__` is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the `__dict__` attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with `__slots__` declared to be `()` (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for `__slots__` and `__dict__` (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

``````       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272‚Ä†   16         56 + 112‚Ä† | ‚Ä†if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408
43     384        56 + 3344   384        56 + 752
``````

So, in spite of smaller dicts in Python 3, we see how nicely `__slots__` scale for instances to save us memory, and that is a major reason you would want to use `__slots__`.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

``````>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72
``````

## Demonstration of `__slots__`:

To deny the creation of a `__dict__`, you must subclass `object`. Everything subclasses `object` in Python 3, but in Python 2 you had to be explicit:

``````class Base(object):
__slots__ = ()
``````

now:

``````>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
b.a = "a"
AttributeError: "Base" object has no attribute "a"
``````

Or subclass another class that defines `__slots__`

``````class Child(Base):
__slots__ = ("a",)
``````

and now:

``````c = Child()
c.a = "a"
``````

but:

``````>>> c.b = "b"
Traceback (most recent call last):
File "<pyshell#42>", line 1, in <module>
c.b = "b"
AttributeError: "Child" object has no attribute "b"
``````

To allow `__dict__` creation while subclassing slotted objects, just add `"__dict__"` to the `__slots__` (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

``````class SlottedWithDict(Child):
__slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"
``````

and

``````>>> swd.__dict__
{"c": "c"}
``````

Or you don"t even need to declare `__slots__` in your subclass, and you will still use slots from the parents, but not restrict the creation of a `__dict__`:

``````class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"
``````

And:

``````>>> ns.__dict__
{"b": "b"}
``````

However, `__slots__` may cause problems for multiple inheritance:

``````class BaseA(object):
__slots__ = ("a",)

class BaseB(object):
__slots__ = ("b",)
``````

Because creating a child class from parents with both non-empty slots fails:

``````>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
File "<pyshell#68>", line 1, in <module>
class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict
``````

If you run into this problem, You could just remove `__slots__` from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

``````from abc import ABC

class AbstractA(ABC):
__slots__ = ()

class BaseA(AbstractA):
__slots__ = ("a",)

class AbstractB(ABC):
__slots__ = ()

class BaseB(AbstractB):
__slots__ = ("b",)

class Child(AbstractA, AbstractB):
__slots__ = ("a", "b")

c = Child() # no problem!
``````

### Add `"__dict__"` to `__slots__` to get dynamic assignment:

``````class Foo(object):
__slots__ = "bar", "baz", "__dict__"
``````

and now:

``````>>> foo = Foo()
>>> foo.boink = "boink"
``````

So with `"__dict__"` in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use `__slots__` - names that are in `__slots__` point to slotted values, while any other values are put in the instance"s `__dict__`.

Avoiding `__slots__` because you want to be able to add attributes on the fly is actually not a good reason - just add `"__dict__"` to your `__slots__` if this is required.

You can similarly add `__weakref__` to `__slots__` explicitly if you need that feature.

### Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

``````from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
"""MyNT is an immutable and lightweight object"""
__slots__ = ()
``````

usage:

``````>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"
``````

And trying to assign an unexpected attribute raises an `AttributeError` because we have prevented the creation of `__dict__`:

``````>>> nt.quux = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"
``````

You can allow `__dict__` creation by leaving off `__slots__ = ()`, but you can"t use non-empty `__slots__` with subtypes of tuple.

## Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

``````class Foo(object):
__slots__ = "foo", "bar"
class Bar(object):
__slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict
``````

Using an empty `__slots__` in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding `"__dict__"` to get dynamic assignment, see section above) the creation of a `__dict__`:

``````class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"
``````

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty `__slots__` in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

``````class AbstractBase:
__slots__ = ()
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"
``````

We could use the above directly by inheriting and declaring the expected slots:

``````class Foo(AbstractBase):
__slots__ = "a", "b"
``````

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

``````class AbstractBaseC:
__slots__ = ()
@property
def c(self):
print("getting c!")
return self._c
@c.setter
def c(self, arg):
print("setting c!")
self._c = arg
``````

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given `AbstractBase` nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

``````class Concretion(AbstractBase, AbstractBaseC):
__slots__ = "a b _c".split()
``````

And now we have functionality from both via multiple inheritance, and can still deny `__dict__` and `__weakref__` instantiation:

``````>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"
``````

## Other cases to avoid slots:

• Avoid them when you want to perform `__class__` assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
• Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
• Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the `__slots__` documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

### Do not "only use `__slots__` when instantiating lots of objects"

I quote:

"You would want to use `__slots__` if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the `collections` module, are not instantiated, yet `__slots__` are declared for them.

Why?

If a user wishes to deny `__dict__` or `__weakref__` creation, those things must not be available in the parent classes.

`__slots__` contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

### `__slots__` doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading `TypeError`:

``````>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
``````

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the `-1` argument. In Python 2.7 this would be `2` (which was introduced in 2.3), and in 3.6 it is `4`.

``````>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>
``````

in Python 2.7:

``````>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>
``````

in Python 3.6

``````>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>
``````

So I would keep this in mind, as it is a solved problem.

## Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of `__slots__` is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the `__dict__` when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid `__slots__`. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with `__slots__`.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

# Memory usage evidence

Create some normal objects and slotted objects:

``````>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()
``````

Instantiate a million of them:

``````>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]
``````

Inspect with `guppy.hpy().heap()`:

``````>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
0 1000000  49 64000000  64  64000000  64 __main__.Foo
1     169   0 16281480  16  80281480  80 list
2 1000000  49 16000000  16  96281480  97 __main__.Bar
3   12284   1   987472   1  97268952  97 str
...
``````

Access the regular objects and their `__dict__` and inspect again:

``````>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
1 1000000  33  64000000  17 344000000  91 __main__.Foo
2     169   0  16281480   4 360281480  95 list
3 1000000  33  16000000   4 376281480  99 __main__.Bar
4   12284   0    987472   0 377268952  99 str
...
``````

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate `__dict__` and `__weakrefs__`. (The `__dict__` is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "`__slots__ = []`" to your class.

# `os.listdir()` - list in the current directory

With listdir in os module you get the files and the folders in the current dir

`````` import os
arr = os.listdir()
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

## Looking in a directory

``````arr = os.listdir("c:\files")
``````

# `glob` from glob

with glob you can specify a type of file to list like this

``````import glob

txtfiles = []
for file in glob.glob("*.txt"):
txtfiles.append(file)
``````

## `glob` in a list comprehension

``````mylist = [f for f in glob.glob("*.txt")]
``````

## get the full path of only files in the current directory

``````import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles)

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]
``````

## Getting the full path name with `os.path.abspath`

You get the full path in return

`````` import os
files_path = [os.path.abspath(x) for x in os.listdir()]
print(files_path)

["F:\documentiapplications.txt", "F:\documenticollections.txt"]
``````

## Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

``````import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
for file in f:
if file.endswith(".docx"):
print(os.path.join(r, file))
``````

### `os.listdir()`: get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

`````` import os
arr = os.listdir(".")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### To go up in the directory tree

``````# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")
``````

### Get files: `os.listdir()` in a particular directory (Python 2 and 3)

`````` import os
arr = os.listdir("F:\python")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### Get files of a particular subdirectory with `os.listdir()`

``````import os

x = os.listdir("./content")
``````

### `os.walk(".")` - current directory

`````` import os
arr = next(os.walk("."))[2]
print(arr)

>>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]
``````

### `next(os.walk("."))` and `os.path.join("dir", "file")`

`````` import os
arr = []
for d,r,f in next(os.walk("F:\_python")):
for file in f:
arr.append(os.path.join(r,file))

for f in arr:
print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt
``````

### `next(os.walk("F:\")` - get the full path - list comprehension

`````` [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]

>>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]
``````

### `os.walk` - get full path - all files in sub dirs**

``````x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

``````

### `os.listdir()` - get only txt files

`````` arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
print(arr_txt)

>>> ["work.txt", "3ebooks.txt"]
``````

## Using `glob` to get the full path of the files

If I should need the absolute path of the files:

``````from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt
``````

## Using `os.path.isfile` to avoid directories in the list

``````import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]
``````

## Using `pathlib` from Python 3.4

``````import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
if p.is_file():
print(p)
flist.append(p)

>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speak_gui2.py
>>> thumb.PNG
``````

With `list comprehension`:

``````flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]
``````

Alternatively, use `pathlib.Path()` instead of `pathlib.Path(".")`

## Use glob method in pathlib.Path()

``````import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py
``````

## Get all and only files with os.walk

``````import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
for f in t:
y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]
``````

## Get only files with next and walk in a directory

`````` import os
x = next(os.walk("F://python"))[2]
print(x)

>>> ["calculator.bat","calculator.py"]
``````

## Get only directories with next and walk in a directory

`````` import os
next(os.walk("F://python"))[1] # for the current dir use (".")

>>> ["python3","others"]
``````

## Get all the subdir names with `walk`

``````for r,d,f in os.walk("F:\_python"):
for dirs in d:
print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints
``````

## `os.scandir()` from Python 3.5 and greater

``````import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
for entry in i:
if entry.is_file():
print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG
``````

# Examples:

## Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

``````import os

def count(dir, counter=0):
"returns number of files in dir and subdirs"
for pack in os.walk(dir):
for f in pack[2]:
counter += 1
return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"
``````

## Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

``````import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
"Searches for pptx (or other - pptx is the default) files and copies them"
for pack in os.walk(dir):
for f in pack[2]:
if f.endswith(filetype):
fullpath = pack[0] + "\" + f
print(fullpath)
shutil.copy(fullpath, destination)
counter += 1
if counter > 0:
print("-" * 30)
print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
"searches for folders that starts with `_`"
if dir[0] == "_":
# copyfile(dir, filetype="pdf")
copyfile(dir, filetype="txt")

>>> _compiti18Compito Contabilit√† 1conti.txt
>>> _compiti18Compito Contabilit√† 1modula4.txt
>>> _compiti18Compito Contabilit√† 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files
``````

## Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

``````import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
for eachfile in os.listdir():
mylist += eachfile + "
"
file.write(mylist)
``````

## Example: txt with all the files of an hard drive

``````"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
for root, dirs, files in os.walk("D:\"):
for file in files:
listafile.append(file)
percorso.append(root + "\" + file)
testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
for file in listafile:
testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
for file in percorso:
file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")
``````

## All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

``````import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk("C:\"):
for file in f:
filewrite.write(f"{r + file}
")
``````

## How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

``````import os

def searchfiles(extension=".ttf", folder="H:\"):
"Create a txt file with all the file of a type"
with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png
``````

## (New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list.

``````import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
"insert all files in the listbox"
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
lb.insert(0, r + "\" + file)

def open_file():
os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()
``````

I just used the following which was quite simple. First open a console then cd to where you"ve downloaded your file like some-package.whl and use

``````pip install some-package.whl
``````

Note: if pip.exe is not recognized, you may find it in the "Scripts" directory from where python has been installed. If pip is not installed, this page can help: How do I install pip on Windows?

Note: for clarification
If you copy the `*.whl` file to your local drive (ex. C:some-dirsome-file.whl) use the following command line parameters --

``````pip install C:/some-dir/some-file.whl
``````

The simplest way to get row counts per group is by calling `.size()`, which returns a `Series`:

``````df.groupby(["col1","col2"]).size()
``````

Usually you want this result as a `DataFrame` (instead of a `Series`) so you can do:

``````df.groupby(["col1", "col2"]).size().reset_index(name="counts")
``````

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

## Detailed example:

Consider the following example dataframe:

``````In [2]: df
Out[2]:
col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17
``````

First let"s use `.size()` to get the row counts:

``````In [3]: df.groupby(["col1", "col2"]).size()
Out[3]:
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64
``````

Then let"s use `.size().reset_index(name="counts")` to get the row counts:

``````In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]:
col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1
``````

### Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

``````In [5]: (df
...: .groupby(["col1", "col2"])
...: .agg({
...:     "col3": ["mean", "count"],
...:     "col4": ["median", "min", "count"]
...: }))
Out[5]:
col4                  col3
median   min count      mean count
col1 col2
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1
``````

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`. It looks like this:

``````In [6]: gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...:  .reset_index()
...: )
...:
Out[6]:
col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63
``````

### Footnotes

The code used to generate the test data is shown below:

``````In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["E", "F"],
...:         ["E", "F"],
...:         ["G", "H"]
...:         ])
...:
...: df = pd.DataFrame(
...:     np.hstack([keys,np.random.randn(10,4).round(2)]),
...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...:     df[["col3", "col4", "col5", "col6"]].astype(float)
...:
``````

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN` entries in the mean calculation without telling you about it.

# Using a for loop, how do I access the loop index, from 1 to 5 in this case?

Use `enumerate` to get the index with the element as you iterate:

``````for index, item in enumerate(items):
print(index, item)
``````

And note that Python"s indexes start at zero, so you would get 0 to 4 with the above. If you want the count, 1 to 5, do this:

``````count = 0 # in case items is empty and you need it after the loop
for count, item in enumerate(items, start=1):
print(count, item)
``````

# Unidiomatic control flow

What you are asking for is the Pythonic equivalent of the following, which is the algorithm most programmers of lower-level languages would use:

``````index = 0            # Python"s indexing starts at zero
for item in items:   # Python"s for loops are a "for each" loop
print(index, item)
index += 1
``````

Or in languages that do not have a for-each loop:

``````index = 0
while index < len(items):
print(index, items[index])
index += 1
``````

or sometimes more commonly (but unidiomatically) found in Python:

``````for index in range(len(items)):
print(index, items[index])
``````

# Use the Enumerate Function

Python"s `enumerate` function reduces the visual clutter by hiding the accounting for the indexes, and encapsulating the iterable into another iterable (an `enumerate` object) that yields a two-item tuple of the index and the item that the original iterable would provide. That looks like this:

``````for index, item in enumerate(items, start=0):   # default is zero
print(index, item)
``````

This code sample is fairly well the canonical example of the difference between code that is idiomatic of Python and code that is not. Idiomatic code is sophisticated (but not complicated) Python, written in the way that it was intended to be used. Idiomatic code is expected by the designers of the language, which means that usually this code is not just more readable, but also more efficient.

## Getting a count

Even if you don"t need indexes as you go, but you need a count of the iterations (sometimes desirable) you can start with `1` and the final number will be your count.

``````count = 0 # in case items is empty
for count, item in enumerate(items, start=1):   # default is zero
print(item)

print("there were {0} items printed".format(count))
``````

The count seems to be more what you intend to ask for (as opposed to index) when you said you wanted from 1 to 5.

## Breaking it down - a step by step explanation

To break these examples down, say we have a list of items that we want to iterate over with an index:

``````items = ["a", "b", "c", "d", "e"]
``````

Now we pass this iterable to enumerate, creating an enumerate object:

``````enumerate_object = enumerate(items) # the enumerate object
``````

We can pull the first item out of this iterable that we would get in a loop with the `next` function:

``````iteration = next(enumerate_object) # first iteration from enumerate
print(iteration)
``````

And we see we get a tuple of `0`, the first index, and `"a"`, the first item:

``````(0, "a")
``````

we can use what is referred to as "sequence unpacking" to extract the elements from this two-tuple:

``````index, item = iteration
#   0,  "a" = (0, "a") # essentially this.
``````

and when we inspect `index`, we find it refers to the first index, 0, and `item` refers to the first item, `"a"`.

``````>>> print(index)
0
>>> print(item)
a
``````

# Conclusion

• Python indexes start at zero
• To get these indexes from an iterable as you iterate over it, use the enumerate function
• Using enumerate in the idiomatic way (along with tuple unpacking) creates code that is more readable and maintainable:

So do this:

``````for index, item in enumerate(items, start=0):   # Python indexes start at zero
print(index, item)
``````

Getting some sort of modification date in a cross-platform way is easy - just call `os.path.getmtime(path)` and you"ll get the Unix timestamp of when the file at `path` was last modified.

Getting file creation dates, on the other hand, is fiddly and platform-dependent, differing even between the three big OSes:

Putting this all together, cross-platform code should look something like this...

``````import os
import platform

def creation_date(path_to_file):
"""
Try to get the date that a file was created, falling back to when it was
See http://stackoverflow.com/a/39501288/1709587 for explanation.
"""
if platform.system() == "Windows":
return os.path.getctime(path_to_file)
else:
stat = os.stat(path_to_file)
try:
return stat.st_birthtime
except AttributeError:
# We"re probably on Linux. No easy way to get creation dates here,
return stat.st_mtime
``````

I noticed that every now and then I need to Google fopen all over again, just to build a mental image of what the primary differences between the modes are. So, I thought a diagram will be faster to read next time. Maybe someone else will find that helpful too.

I would suggest using the duplicated method on the Pandas Index itself:

``````df3 = df3[~df3.index.duplicated(keep="first")]
``````

While all the other methods work, `.drop_duplicates` is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

``````>>> %timeit df3.reset_index().drop_duplicates(subset="index", keep="first").set_index("index")
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 ¬µs per loop

>>> %timeit df3[~df3.index.duplicated(keep="first")]
1000 loops, best of 3: 307 ¬µs per loop
``````

Note that you can keep the last element by changing the keep argument to `"last"`.

It should also be noted that this method works with `MultiIndex` as well (using df1 as specified in Paul"s example):

``````>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 ¬µs per loop

>>> %timeit df1[~df1.index.duplicated(keep="last")]
1000 loops, best of 3: 365 ¬µs per loop
``````

Here"s a concise solution which avoids regular expressions and slow in-Python loops:

``````def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
``````

See the Community Wiki answer started by @davidism for benchmark results. In summary,

David Zhang"s solution is the clear winner, outperforming all others by at least 5x for the large example set.

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of `s` in `(s+s)[1:-1]`, and for informing me of the optional `start` and `end` arguments of Python"s `string.find`.

## List of lists changes reflected across sublists unexpectedly

### Question by Charles Anderson

I needed to create a list of lists in Python, so I typed the following:

``````my_list = [[1] * 4] * 3
``````

The list looked like this:

``````[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]
``````

Then I changed one of the innermost values:

``````my_list[0][0] = 5
``````

Now my list looks like this:

``````[[5, 1, 1, 1], [5, 1, 1, 1], [5, 1, 1, 1]]
``````

which is not what I wanted or expected. Can someone please explain what"s going on, and how to get around it?

## Split a python list into other "sublists" i.e smaller lists

I have a python list which runs into 1000"s. Something like:

``````data=["I";"am";"a";"python";"programmer".....]
``````

where, len(data)= say 1003

I would now like to create a subset of this list (data) by splitting the orginal list into chunks of 100. So, at the end, Id like to have something like:

``````data_chunk1=[.....] #first 100 items of list data
data_chunk2=[.....] #second 100 items of list data
.
.
.
data_chunk11=[.....] # remainder of the entries,& its len <=100, len(data_chunk_11)=3
``````

Is there a pythonic way to achieve this task? Obviously I can use data[0:100] and so on, but I am assuming that is terribly non-pythonic and very inefficient.

Many thanks.

## Extract first item of each sublist

I am wondering what is the best way to extract the first item of each sublist in a list of lists and append it to a new list. So if I have:

``````lst = [[a,b,c], [1,2,3], [x,y,z]]
``````

and I want to pull out `a`, `1` and `x` and create a separate list from those.

I tried:

``````lst2.append(x[0] for x in lst)
``````

I tested most suggested solutions with perfplot (a pet project of mine, essentially a wrapper around `timeit`), and found

``````import functools
import operator
functools.reduce(operator.iconcat, a, [])
``````

to be the fastest solution, both when many small lists and few long lists are concatenated. (`operator.iadd` is equally fast.)

Code to reproduce the plot:

``````import functools
import itertools
import numpy
import operator
import perfplot

def forfor(a):
return [item for sublist in a for item in sublist]

def sum_brackets(a):
return sum(a, [])

def functools_reduce(a):
return functools.reduce(operator.concat, a)

def functools_reduce_iconcat(a):
return functools.reduce(operator.iconcat, a, [])

def itertools_chain(a):
return list(itertools.chain.from_iterable(a))

def numpy_flat(a):
return list(numpy.array(a).flat)

def numpy_concatenate(a):
return list(numpy.concatenate(a))

perfplot.show(
setup=lambda n: [list(range(10))] * n,
# setup=lambda n: [list(range(n))] * 10,
kernels=[
forfor,
sum_brackets,
functools_reduce,
functools_reduce_iconcat,
itertools_chain,
numpy_flat,
numpy_concatenate,
],
n_range=[2 ** k for k in range(16)],
xlabel="num lists (of length 10)",
# xlabel="len lists (10 lists total)"
)
``````

There is a clean, one-line way of doing this in Pandas:

``````df["col_3"] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
``````

This allows `f` to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.

Example with data (based on original question):

``````import pandas as pd

df = pd.DataFrame({"ID":["1", "2", "3"], "col_1": [0, 2, 3], "col_2":[1, 4, 5]})
mylist = ["a", "b", "c", "d", "e", "f"]

def get_sublist(sta,end):
return mylist[sta:end+1]

df["col_3"] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
``````

Output of `print(df)`:

``````  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]
``````

If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:

``````df["col_3"] = df.apply(lambda x: f(x["col 1"], x["col 2"]), axis=1)
``````

I think all of the answers here cover the core of what the lambda function does in the context of sorted() quite nicely, however I still feel like a description that leads to an intuitive understanding is lacking, so here is my two cents.

For the sake of completeness, I"ll state the obvious up front: sorted() returns a list of sorted elements and if we want to sort in a particular way or if we want to sort a complex list of elements (e.g. nested lists or a list of tuples) we can invoke the key argument.

For me, the intuitive understanding of the key argument, why it has to be callable, and the use of lambda as the (anonymous) callable function to accomplish this comes in two parts.

1. Using lamba ultimately means you don"t have to write (define) an entire function, like the one sblom provided an example of. Lambda functions are created, used, and immediately destroyed - so they don"t funk up your code with more code that will only ever be used once. This, as I understand it, is the core utility of the lambda function and its application for such a role is broad. Its syntax is purely a convention, which is in essence the nature of programmatic syntax in general. Learn the syntax and be done with it.

Lambda syntax is as follows:

``````lambda input_variable(s): tasty one liner
``````

where `lambda` is a python keyword.

e.g.

``````In [1]: f00 = lambda x: x/2

In [2]: f00(10)
Out[2]: 5.0

In [3]: (lambda x: x/2)(10)
Out[3]: 5.0

In [4]: (lambda x, y: x / y)(10, 2)
Out[4]: 5.0

In [5]: (lambda: "amazing lambda")() # func with no args!
Out[5]: "amazing lambda"
``````
1. The idea behind the `key` argument is that it should take in a set of instructions that will essentially point the "sorted()" function at those list elements which should be used to sort by. When it says `key=`, what it really means is: As I iterate through the list, one element at a time (i.e. `for e in some_list`), I"m going to pass the current element to the function specifed by the key argument and use that to create a transformed list which will inform me on the order of the final sorted list.

Check it out:

``````In [6]: mylist = [3, 6, 3, 2, 4, 8, 23]  # an example list
# sorted(mylist, key=HowToSort)  # what we will be doing
``````

Base example:

``````# mylist = [3, 6, 3, 2, 4, 8, 23]
In [7]: sorted(mylist)
Out[7]: [2, 3, 3, 4, 6, 8, 23]
# all numbers are in ascending order (i.e.from low to high).
``````

Example 1:

``````# mylist = [3, 6, 3, 2, 4, 8, 23]
In [8]: sorted(mylist, key=lambda x: x % 2 == 0)

# Quick Tip: The % operator returns the *remainder* of a division
# operation. So the key lambda function here is saying "return True
# if x divided by 2 leaves a remainer of 0, else False". This is a
# typical way to check if a number is even or odd.

Out[8]: [3, 3, 23, 6, 2, 4, 8]
# Does this sorted result make intuitive sense to you?
``````

Notice that my lambda function told `sorted` to check if each element `e` was even or odd before sorting.

BUT WAIT! You may (or perhaps should) be wondering two things.

First, why are the odd numbers coming before the even numbers? After all, the key value seems to be telling the `sorted` function to prioritize evens by using the `mod` operator in `x % 2 == 0`.

Second, why are the even numbers still out of order? 2 comes before 6, right?

By analyzing this result, we"ll learn something deeper about how the "key" argument really works, especially in conjunction with the anonymous lambda function.

Firstly, you"ll notice that while the odds come before the evens, the evens themselves are not sorted. Why is this?? Lets read the docs:

Key Functions Starting with Python 2.4, both list.sort() and sorted() added a key parameter to specify a function to be called on each list element prior to making comparisons.

We have to do a little bit of reading between the lines here, but what this tells us is that the sort function is only called once, and if we specify the key argument, then we sort by the value that key function points us to.

So what does the example using a modulo return? A boolean value: `True == 1`, `False == 0`. So how does sorted deal with this key? It basically transforms the original list to a sequence of 1s and 0s.

`[3, 6, 3, 2, 4, 8, 23]` becomes `[0, 1, 0, 1, 1, 1, 0]`

Now we"re getting somewhere. What do you get when you sort the transformed list?

`[0, 0, 0, 1, 1, 1, 1]`

Okay, so now we know why the odds come before the evens. But the next question is: Why does the 6 still come before the 2 in my final list? Well that"s easy - it is because sorting only happens once! Those 1s still represent the original list values, which are in their original positions relative to each other. Since sorting only happens once, and we don"t call any kind of sort function to order the original even numbers from low to high, those values remain in their original order relative to one another.

The final question is then this: How do I think conceptually about how the order of my boolean values get transformed back in to the original values when I print out the final sorted list?

Sorted() is a built-in method that (fun fact) uses a hybrid sorting algorithm called Timsort that combines aspects of merge sort and insertion sort. It seems clear to me that when you call it, there is a mechanic that holds these values in memory and bundles them with their boolean identity (mask) determined by (...!) the lambda function. The order is determined by their boolean identity calculated from the lambda function, but keep in mind that these sublists (of one"s and zeros) are not themselves sorted by their original values. Hence, the final list, while organized by Odds and Evens, is not sorted by sublist (the evens in this case are out of order). The fact that the odds are ordered is because they were already in order by coincidence in the original list. The takeaway from all this is that when lambda does that transformation, the original order of the sublists are retained.

So how does this all relate back to the original question, and more importantly, our intuition on how we should implement sorted() with its key argument and lambda?

That lambda function can be thought of as a pointer that points to the values we need to sort by, whether its a pointer mapping a value to its boolean transformed by the lambda function, or if its a particular element in a nested list, tuple, dict, etc., again determined by the lambda function.

Lets try and predict what happens when I run the following code.

``````In [9]: mylist = [(3, 5, 8), (6, 2, 8), (2, 9, 4), (6, 8, 5)]
In[10]: sorted(mylist, key=lambda x: x[1])
``````

My `sorted` call obviously says, "Please sort this list". The key argument makes that a little more specific by saying, "for each element `x` in `mylist`, return the second index of that element, then sort all of the elements of the original list `mylist` by the sorted order of the list calculated by the lambda function. Since we have a list of tuples, we can return an indexed element from that tuple using the lambda function.

The pointer that will be used to sort would be:

``````[5, 2, 9, 8] # the second element of each tuple
``````

Sorting this pointer list returns:

``````[2, 5, 8, 9]
``````

Applying this to `mylist`, we get:

``````Out[10]: [(6, 2, 8), (3, 5, 8), (6, 8, 5), (2, 9, 4)]
# Notice the sorted pointer list is the same as the second index of each tuple in this final list
``````

Run that code, and you"ll find that this is the order. Try sorting a list of integers using this key function and you"ll find that the code breaks (why? Because you cannot index an integer of course).

This was a long winded explanation, but I hope this helps to `sort` your intuition on the use of `lambda` functions - as the key argument in sorted(), and beyond.

If you want to flatten a data-structure where you don"t know how deep it"s nested you could use `iteration_utilities.deepflatten`1

``````>>> from iteration_utilities import deepflatten

>>> l = [[1, 2, 3], [4, 5, 6], [7], [8, 9]]
>>> list(deepflatten(l, depth=1))
[1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> l = [[1, 2, 3], [4, [5, 6]], 7, [8, 9]]
>>> list(deepflatten(l))
[1, 2, 3, 4, 5, 6, 7, 8, 9]
``````

It"s a generator so you need to cast the result to a `list` or explicitly iterate over it.

To flatten only one level and if each of the items is itself iterable you can also use `iteration_utilities.flatten` which itself is just a thin wrapper around `itertools.chain.from_iterable`:

``````>>> from iteration_utilities import flatten
>>> l = [[1, 2, 3], [4, 5, 6], [7], [8, 9]]
>>> list(flatten(l))
[1, 2, 3, 4, 5, 6, 7, 8, 9]
``````

Just to add some timings (based on Nico Schl√∂mer"s answer that didn"t include the function presented in this answer):

It"s a log-log plot to accommodate for the huge range of values spanned. For qualitative reasoning: Lower is better.

The results show that if the iterable contains only a few inner iterables then `sum` will be fastest, however for long iterables only the `itertools.chain.from_iterable`, `iteration_utilities.deepflatten` or the nested comprehension have reasonable performance with `itertools.chain.from_iterable` being the fastest (as already noticed by Nico Schl√∂mer).

``````from itertools import chain
from functools import reduce
from collections import Iterable  # or from collections.abc import Iterable
import operator
from iteration_utilities import deepflatten

def nested_list_comprehension(lsts):
return [item for sublist in lsts for item in sublist]

def itertools_chain_from_iterable(lsts):
return list(chain.from_iterable(lsts))

def pythons_sum(lsts):
return sum(lsts, [])

return reduce(lambda x, y: x + y, lsts)

def pylangs_flatten(lsts):
return list(flatten(lsts))

def flatten(items):
"""Yield items from any nested iterable; see REF."""
for x in items:
if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
yield from flatten(x)
else:
yield x

def reduce_concat(lsts):
return reduce(operator.concat, lsts)

def iteration_utilities_deepflatten(lsts):
return list(deepflatten(lsts, depth=1))

from simple_benchmark import benchmark

b = benchmark(
pylangs_flatten, reduce_concat, iteration_utilities_deepflatten],
arguments={2**i: [[0]*5]*(2**i) for i in range(1, 13)},
argument_name="number of inner lists"
)

b.plot()
``````

1 Disclaimer: I"m the author of that library

Given a list of lists `t`,

``````flat_list = [item for sublist in t for item in sublist]
``````

which means:

``````flat_list = []
for sublist in t:
for item in sublist:
flat_list.append(item)
``````

is faster than the shortcuts posted so far. (`t` is the list to flatten.)

Here is the corresponding function:

``````def flatten(t):
return [item for sublist in t for item in sublist]
``````

As evidence, you can use the `timeit` module in the standard library:

``````\$ python -mtimeit -s"t=[[1,2,3],[4,5,6], [7], [8,9]]*99" "[item for sublist in t for item in sublist]"
10000 loops, best of 3: 143 usec per loop
\$ python -mtimeit -s"t=[[1,2,3],[4,5,6], [7], [8,9]]*99" "sum(t, [])"
1000 loops, best of 3: 969 usec per loop
\$ python -mtimeit -s"t=[[1,2,3],[4,5,6], [7], [8,9]]*99" "reduce(lambda x,y: x+y,t)"
1000 loops, best of 3: 1.1 msec per loop
``````

Explanation: the shortcuts based on `+` (including the implied use in `sum`) are, of necessity, `O(T**2)` when there are T sublists -- as the intermediate result list keeps getting longer, at each step a new intermediate result list object gets allocated, and all the items in the previous intermediate result must be copied over (as well as a few new ones added at the end). So, for simplicity and without actual loss of generality, say you have T sublists of k items each: the first k items are copied back and forth T-1 times, the second k items T-2 times, and so on; total number of copies is k times the sum of x for x from 1 to T excluded, i.e., `k * (T**2)/2`.

The list comprehension just generates one list, once, and copies each item over (from its original place of residence to the result list) also exactly once.

When you write `[x]*3` you get, essentially, the list `[x, x, x]`. That is, a list with 3 references to the same `x`. When you then modify this single `x` it is visible via all three references to it:

``````x = [1] * 4
l = [x] * 3
print(f"id(x): {id(x)}")
# id(x): 140560897920048
print(
f"id(l[0]): {id(l[0])}
"
f"id(l[1]): {id(l[1])}
"
f"id(l[2]): {id(l[2])}"
)
# id(l[0]): 140560897920048
# id(l[1]): 140560897920048
# id(l[2]): 140560897920048

x[0] = 42
print(f"x: {x}")
# x: [42, 1, 1, 1]
print(f"l: {l}")
# l: [[42, 1, 1, 1], [42, 1, 1, 1], [42, 1, 1, 1]]
``````

To fix it, you need to make sure that you create a new list at each position. One way to do it is

``````[[1]*4 for _ in range(3)]
``````

which will reevaluate `[1]*4` each time instead of evaluating it once and making 3 references to 1 list.

You might wonder why `*` can"t make independent objects the way the list comprehension does. That"s because the multiplication operator `*` operates on objects, without seeing expressions. When you use `*` to multiply `[[1] * 4]` by 3, `*` only sees the 1-element list `[[1] * 4]` evaluates to, not the `[[1] * 4` expression text. `*` has no idea how to make copies of that element, no idea how to reevaluate `[[1] * 4]`, and no idea you even want copies, and in general, there might not even be a way to copy the element.

The only option `*` has is to make new references to the existing sublist instead of trying to make new sublists. Anything else would be inconsistent or require major redesigning of fundamental language design decisions.

In contrast, a list comprehension reevaluates the element expression on every iteration. `[[1] * 4 for n in range(3)]` reevaluates `[1] * 4` every time for the same reason `[x**2 for x in range(3)]` reevaluates `x**2` every time. Every evaluation of `[1] * 4` generates a new list, so the list comprehension does what you wanted.

Incidentally, `[1] * 4` also doesn"t copy the elements of `[1]`, but that doesn"t matter, since integers are immutable. You can"t do something like `1.value = 2` and turn a 1 into a 2.

``````buckets = [0] * 100
``````

Careful - this technique doesn"t generalize to multidimensional arrays or lists of lists. Which leads to the List of lists changes reflected across sublists unexpectedly problem

If I understood the question correctly, you can use the slicing notation to keep everything except the last item:

``````record = record[:-1]
``````

But a better way is to delete the item directly:

``````del record[-1]
``````

Note 1: Note that using record = record[:-1] does not really remove the last element, but assign the sublist to record. This makes a difference if you run it inside a function and record is a parameter. With record = record[:-1] the original list (outside the function) is unchanged, with del record[-1] or record.pop() the list is changed. (as stated by @pltrdy in the comments)

Note 2: The code could use some Python idioms. I highly recommend reading this:
Code Like a Pythonista: Idiomatic Python (via wayback machine archive).

Flatten the list to "remove the brackets" using a nested list comprehension. This will un-nest each list stored in your list of lists!

``````list_of_lists = [[180.0], [173.8], [164.2], [156.5], [147.2], [138.2]]
flattened = [val for sublist in list_of_lists for val in sublist]
``````

Nested list comprehensions evaluate in the same manner that they unwrap (i.e. add newline and tab for each new loop. So in this case:

``````flattened = [val for sublist in list_of_lists for val in sublist]
``````

is equivalent to:

``````flattened = []
for sublist in list_of_lists:
for val in sublist:
flattened.append(val)
``````

The big difference is that the list comp evaluates MUCH faster than the unraveled loop and eliminates the append calls!

If you have multiple items in a sublist the list comp will even flatten that. ie

``````>>> list_of_lists = [[180.0, 1, 2, 3], [173.8], [164.2], [156.5], [147.2], [138.2]]
>>> flattened  = [val for sublist in list_of_lists for val in sublist]
>>> flattened
[180.0, 1, 2, 3, 173.8, 164.2, 156.5, 147.2,138.2]
``````

If you want:

``````c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
c3 = [[13, 32], [7, 13, 28], [1,6]]
``````

Then here is your solution for Python 2:

``````c3 = [filter(lambda x: x in c1, sublist) for sublist in c2]
``````

In Python 3 `filter` returns an iterable instead of `list`, so you need to wrap `filter` calls with `list()`:

``````c3 = [list(filter(lambda x: x in c1, sublist)) for sublist in c2]
``````

Explanation:

The filter part takes each sublist"s item and checks to see if it is in the source list c1. The list comprehension is executed for each sublist in c2.