PHP program for finding the standard deviation of an array

find | iat | PHP

Examples:
`Input: array (2, 3, 5, 6, 7) Output: 1.5620499351813 Input: array (1, 2, 3, 4, 5) Output: 1`
The following problem can be solved with the built-in functions PHP . The built-in functions used to solve the above problem are as follows:
• array_sum ( ) : The function returns the sum of all the elements in the array.
• count() : This function determines the number of elements currently present in the given array.
• sqrt() : The function returns the square root of the specified number.
• To calculate the standard deviation, we must first calculate the variance. The variance can be calculated as the sum of the squares of the differences between all numbers and the means. Finally, to get the standard deviation, we will use the formula √ (variance / no_of_elements).Below is the PHP implementation to calculate the standard deviation:``` // function for calculating the standard deviation // array elements function Stand_Deviation ( \$arr ) { \$num_of_elements = count ( \$arr ); \$variance = 0.0; // calculate the average using the array_sum() method \$average = array_sum ( \$arr ) / \$num_of_elements ; foreach ( \$arr as \$i ) { // sum of the squared differences between // all numbers and means. \$variance + = pow (( \$i - \$average ), 2); } return (float ) sqrt ( \$variance / \$num_of_elements ); } // input array \$arr = array (2, 3, 5, 6, 7); print_r (Stand_Deviation ( \$arr ));   ?> Output:1.8547236990991 (adsbygoogle = window.adsbygoogle || []).push({}); ```
``` ```
``` ```
``` PHP program for finding the standard deviation of an array: StackOverflow Questions Finding the index of an item in a list Given a list ["foo", "bar", "baz"] and an item in the list "bar", how do I get its index (1) in Python? Find current directory and file"s directory In Python, what commands can I use to find: the current directory (where I was in the terminal when I ran the Python script), and where the file I am executing is? How to find if directory exists in Python In the os module in Python, is there a way to find if a directory exists, something like: >>> os.direxists(os.path.join(os.getcwd()), "new_folder")) # in pseudocode True/False How do I find the location of my Python site-packages directory? Question by Daryl Spitzer How do I find the location of my site-packages directory? Find all files in a directory with extension .txt in Python How can I find all the files in a directory having the extension .txt in python? Find which version of package is installed with pip Using pip, is it possible to figure out which version of a package is currently installed? I know about pip install XYZ --upgrade but I am wondering if there is anything like pip info XYZ. If not what would be the best way to tell what version I am currently using. error: Unable to find vcvarsall.bat I tried to install the Python package dulwich: pip install dulwich But I get a cryptic error message: error: Unable to find vcvarsall.bat The same happens if I try installing the package manually: > python setup.py install running build_ext building "dulwich._objects" extension error: Unable to find vcvarsall.bat How to use glob() to find files recursively? This is what I have: glob(os.path.join("src","*.c")) but I want to search the subfolders of src. Something like this would work: glob(os.path.join("src","*.c")) glob(os.path.join("src","*","*.c")) glob(os.path.join("src","*","*","*.c")) glob(os.path.join("src","*","*","*","*.c")) But this is obviously limited and clunky. Python: Find in list I have come across this: item = someSortOfSelection() if item in myList: doMySpecialFunction(item) but sometimes it does not work with all my items, as if they weren"t recognized in the list (when it"s a list of string). Is this the most "pythonic" way of finding an item in a list: if x in l:? How to find out the number of CPUs using python I want to know the number of CPUs on the local machine using Python. The result should be user/real as output by time(1) when called with an optimally scaling userspace-only program. Answer #1 How to iterate over rows in a DataFrame in Pandas? Answer: DON"T*! Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting. Do you want to print a DataFrame? Use DataFrame.to_string(). Do you want to compute something? In that case, search for methods in this order (list modified from here): Vectorization Cython routines List Comprehensions (vanilla for loop) DataFrame.apply(): i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space DataFrame.itertuples() and iteritems() DataFrame.iterrows() iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for. Appeal to Authority The documentation page on iteration has a huge red warning box that says: Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...]. * It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom. Faster than Looping: Vectorization, Cython A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem. If none exists, feel free to write your own using custom Cython extensions. Next Best Thing: List Comprehensions* List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks. The formula is simple, # Iterating over one column - `f` is some function that processes your data result = [f(x) for x in df["col"]] # Iterating over two columns, use `zip` result = [f(x, y) for x, y in zip(df["col1"], df["col2"])] # Iterating over multiple columns - same data type result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()] # Iterating over multiple columns - differing data type result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])] If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code. Caveats List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this. *Your mileage may vary for the reasons outlined in the Caveats section above. An Obvious Example Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above. Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy). I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one. Further Reading 10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions. Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations Are for-loops in pandas really bad? When should I care? - a detailed writeup by me on list comprehensions and their suitability for various operations (mainly ones involving non-numeric data) When should I (not) want to use pandas apply() in my code? - apply is slow (but not as slow as the iter* family. There are, however, situations where one can (or should) consider apply as a serious alternative, especially in some GroupBy operations). * Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize. Why I Wrote this Answer A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do. The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library. Answer #2 In Python, what is the purpose of __slots__ and what are the cases one should avoid this? TLDR: The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results: faster attribute access. space savings in memory. The space savings is from Storing value references in slots instead of __dict__. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__. Quick Caveats Small caveat, you should only declare a particular slot one time in an inheritance tree. For example: class Base: __slots__ = "foo", "bar" class Right(Base): __slots__ = "baz", class Wrong(Base): __slots__ = "foo", "bar", "baz" # redundant foo and bar Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8: >>> from sys import getsizeof >>> getsizeof(Right()), getsizeof(Wrong()) (56, 72) This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could: >>> w = Wrong() >>> w.foo = "foo" >>> Base.foo.__get__(w) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: foo >>> Wrong.foo.__get__(w) "foo" The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined. To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library). See section on multiple inheritance below for an example. Requirements: To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2). To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry. There are a lot of details if you wish to keep reading. Why use __slots__: Faster attribute access. The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access. It is trivial to demonstrate measurably significant faster access: import timeit class Foo(object): __slots__ = "foo", class Bar(object): pass slotted = Foo() not_slotted = Bar() def get_set_delete_fn(obj): def get_set_delete(): obj.foo = "foo" obj.foo del obj.foo return get_set_delete and >>> min(timeit.repeat(get_set_delete_fn(slotted))) 0.2846834529991611 >>> min(timeit.repeat(get_set_delete_fn(not_slotted))) 0.3664822799983085 The slotted access is almost 30% faster in Python 3.5 on Ubuntu. >>> 0.3664822799983085 / 0.2846834529991611 1.2873325658284342 In Python 2 on Windows I have measured it about 15% faster. Why use __slots__: Memory Savings Another purpose of __slots__ is to reduce the space in memory that each object instance takes up. My own contribution to the documentation clearly states the reasons behind this: The space saved over using __dict__ can be significant. SQLAlchemy attributes a lot of memory savings to __slots__. To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally. In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two. For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes): Python 2.7 Python 3.6 attrs __slots__ __dict__* __slots__ __dict__* | *(no slots defined) none 16 56 + 272‚Ä† 16 56 + 112‚Ä† | ‚Ä†if __dict__ referenced one 48 56 + 272 48 56 + 112 two 56 56 + 272 56 56 + 112 six 88 56 + 1040 88 56 + 152 11 128 56 + 1040 128 56 + 240 22 216 56 + 3344 216 56 + 408 43 384 56 + 3344 384 56 + 752 So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__. Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members". >>> Foo.foo <member "foo" of "Foo" objects> >>> type(Foo.foo) <class "member_descriptor"> >>> getsizeof(Foo.foo) 72 Demonstration of __slots__: To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit: class Base(object): __slots__ = () now: >>> b = Base() >>> b.a = "a" Traceback (most recent call last): File "<pyshell#38>", line 1, in <module> b.a = "a" AttributeError: "Base" object has no attribute "a" Or subclass another class that defines __slots__ class Child(Base): __slots__ = ("a",) and now: c = Child() c.a = "a" but: >>> c.b = "b" Traceback (most recent call last): File "<pyshell#42>", line 1, in <module> c.b = "b" AttributeError: "Child" object has no attribute "b" To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes): class SlottedWithDict(Child): __slots__ = ("__dict__", "b") swd = SlottedWithDict() swd.a = "a" swd.b = "b" swd.c = "c" and >>> swd.__dict__ {"c": "c"} Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__: class NoSlots(Child): pass ns = NoSlots() ns.a = "a" ns.b = "b" And: >>> ns.__dict__ {"b": "b"} However, __slots__ may cause problems for multiple inheritance: class BaseA(object): __slots__ = ("a",) class BaseB(object): __slots__ = ("b",) Because creating a child class from parents with both non-empty slots fails: >>> class Child(BaseA, BaseB): __slots__ = () Traceback (most recent call last): File "<pyshell#68>", line 1, in <module> class Child(BaseA, BaseB): __slots__ = () TypeError: Error when calling the metaclass bases multiple bases have instance lay-out conflict If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions: from abc import ABC class AbstractA(ABC): __slots__ = () class BaseA(AbstractA): __slots__ = ("a",) class AbstractB(ABC): __slots__ = () class BaseB(AbstractB): __slots__ = ("b",) class Child(AbstractA, AbstractB): __slots__ = ("a", "b") c = Child() # no problem! Add "__dict__" to __slots__ to get dynamic assignment: class Foo(object): __slots__ = "bar", "baz", "__dict__" and now: >>> foo = Foo() >>> foo.boink = "boink" So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect. When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__. Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required. You can similarly add __weakref__ to __slots__ explicitly if you need that feature. Set to empty tuple when subclassing a namedtuple: The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them: from collections import namedtuple class MyNT(namedtuple("MyNT", "bar baz")): """MyNT is an immutable and lightweight object""" __slots__ = () usage: >>> nt = MyNT("bar", "baz") >>> nt.bar "bar" >>> nt.baz "baz" And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__: >>> nt.quux = "quux" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: "MyNT" object has no attribute "quux" You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple. Biggest Caveat: Multiple inheritance Even when non-empty slots are the same for multiple parents, they cannot be used together: class Foo(object): __slots__ = "foo", "bar" class Bar(object): __slots__ = "foo", "bar" # alas, would work if empty, i.e. () >>> class Baz(Foo, Bar): pass Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Error when calling the metaclass bases multiple bases have instance lay-out conflict Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__: class Foo(object): __slots__ = () class Bar(object): __slots__ = () class Baz(Foo, Bar): __slots__ = ("foo", "bar") b = Baz() b.foo, b.bar = "foo", "bar" You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems. Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers. To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance class AbstractBase: __slots__ = () def __init__(self, a, b): self.a = a self.b = b def __repr__(self): return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})" We could use the above directly by inheriting and declaring the expected slots: class Foo(AbstractBase): __slots__ = "a", "b" But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute: class AbstractBaseC: __slots__ = () @property def c(self): print("getting c!") return self._c @c.setter def c(self, arg): print("setting c!") self._c = arg Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong): class Concretion(AbstractBase, AbstractBaseC): __slots__ = "a b _c".split() And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation: >>> c = Concretion("a", "b") >>> c.c = c setting c! >>> c.c getting c! Concretion("a", "b") >>> c.d = "d" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: "Concretion" object has no attribute "d" Other cases to avoid slots: Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.) Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them. Avoid them if you insist on providing default values via class attributes for instance variables. You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to. Critiques of other answers The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways. Do not "only use __slots__ when instantiating lots of objects" I quote: "You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class." Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them. Why? If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes. __slots__ contributes to reusability when creating interfaces or mixins. It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable. __slots__ doesn"t break pickling When pickling a slotted object, you may find it complains with a misleading TypeError: >>> pickle.loads(pickle.dumps(f)) TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4. >>> pickle.loads(pickle.dumps(f, -1)) <__main__.Foo object at 0x1129C770> in Python 2.7: >>> pickle.loads(pickle.dumps(f, 2)) <__main__.Foo object at 0x1129C770> in Python 3.6 >>> pickle.loads(pickle.dumps(f, 4)) <__main__.Foo object at 0x1129C770> So I would keep this in mind, as it is a solved problem. Critique of the (until Oct 2, 2016) accepted answer The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots The second half is wishful thinking, and off the mark: While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object. Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous. The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.): They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies. It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__. The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site. Memory usage evidence Create some normal objects and slotted objects: >>> class Foo(object): pass >>> class Bar(object): __slots__ = () Instantiate a million of them: >>> foos = [Foo() for f in xrange(1000000)] >>> bars = [Bar() for b in xrange(1000000)] Inspect with guppy.hpy().heap(): >>> guppy.hpy().heap() Partition of a set of 2028259 objects. Total size = 99763360 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1000000 49 64000000 64 64000000 64 __main__.Foo 1 169 0 16281480 16 80281480 80 list 2 1000000 49 16000000 16 96281480 97 __main__.Bar 3 12284 1 987472 1 97268952 97 str ... Access the regular objects and their __dict__ and inspect again: >>> for f in foos: ... f.__dict__ >>> guppy.hpy().heap() Partition of a set of 3028258 objects. Total size = 379763480 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1000000 33 280000000 74 280000000 74 dict of __main__.Foo 1 1000000 33 64000000 17 344000000 91 __main__.Foo 2 169 0 16281480 4 360281480 95 list 3 1000000 33 16000000 4 376281480 99 __main__.Bar 4 12284 0 987472 0 377268952 99 str ... This is consistent with the history of Python, from Unifying types and classes in Python 2.2 If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class. Answer #3 os.listdir() - list in the current directory With listdir in os module you get the files and the folders in the current dir import os arr = os.listdir() print(arr) >>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"] Looking in a directory arr = os.listdir("c:\files") glob from glob with glob you can specify a type of file to list like this import glob txtfiles = [] for file in glob.glob("*.txt"): txtfiles.append(file) glob in a list comprehension mylist = [f for f in glob.glob("*.txt")] get the full path of only files in the current directory import os from os import listdir from os.path import isfile, join cwd = os.getcwd() onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if os.path.isfile(os.path.join(cwd, f))] print(onlyfiles) ["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"] Getting the full path name with os.path.abspath You get the full path in return import os files_path = [os.path.abspath(x) for x in os.listdir()] print(files_path) ["F:\documentiapplications.txt", "F:\documenticollections.txt"] Walk: going through sub directories os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders. import os # Getting the current work directory (cwd) thisdir = os.getcwd() # r=root, d=directories, f = files for r, d, f in os.walk(thisdir): for file in f: if file.endswith(".docx"): print(os.path.join(r, file)) os.listdir(): get files in the current directory (Python 2) In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method. import os arr = os.listdir(".") print(arr) >>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"] To go up in the directory tree # Method 1 x = os.listdir("..") # Method 2 x= os.listdir("/") Get files: os.listdir() in a particular directory (Python 2 and 3) import os arr = os.listdir("F:\python") print(arr) >>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"] Get files of a particular subdirectory with os.listdir() import os x = os.listdir("./content") os.walk(".") - current directory import os arr = next(os.walk("."))[2] print(arr) >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"] next(os.walk(".")) and os.path.join("dir", "file") import os arr = [] for d,r,f in next(os.walk("F:\_python")): for file in f: arr.append(os.path.join(r,file)) for f in arr: print(files) >>> F:\_python\dict_class.py >>> F:\_python\programmi.txt next(os.walk("F:\") - get the full path - list comprehension [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f] >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"] os.walk - get full path - all files in sub dirs** x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f] print(x) >>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"] os.listdir() - get only txt files arr_txt = [x for x in os.listdir() if x.endswith(".txt")] print(arr_txt) >>> ["work.txt", "3ebooks.txt"] Using glob to get the full path of the files If I should need the absolute path of the files: from path import path from glob import glob x = [path(f).abspath() for f in glob("F:\*.txt")] for f in x: print(f) >>> F:acquistionline.txt >>> F:acquisti_2018.txt >>> F:ootstrap_jquery_ecc.txt Using os.path.isfile to avoid directories in the list import os.path listOfFiles = [f for f in os.listdir() if os.path.isfile(f)] print(listOfFiles) >>> ["a simple game.py", "data.txt", "decorator.py"] Using pathlib from Python 3.4 import pathlib flist = [] for p in pathlib.Path(".").iterdir(): if p.is_file(): print(p) flist.append(p) >>> error.PNG >>> exemaker.bat >>> guiprova.mp3 >>> setup.py >>> speak_gui2.py >>> thumb.PNG With list comprehension: flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()] Alternatively, use pathlib.Path() instead of pathlib.Path(".") Use glob method in pathlib.Path() import pathlib py = pathlib.Path().glob("*.py") for file in py: print(file) >>> stack_overflow_list.py >>> stack_overflow_list_tkinter.py Get all and only files with os.walk import os x = [i[2] for i in os.walk(".")] y=[] for t in x: for f in t: y.append(f) print(y) >>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"] Get only files with next and walk in a directory import os x = next(os.walk("F://python"))[2] print(x) >>> ["calculator.bat","calculator.py"] Get only directories with next and walk in a directory import os next(os.walk("F://python"))[1] # for the current dir use (".") >>> ["python3","others"] Get all the subdir names with walk for r,d,f in os.walk("F:\_python"): for dirs in d: print(dirs) >>> .vscode >>> pyexcel >>> pyschool.py >>> subtitles >>> _metaprogramming >>> .ipynb_checkpoints os.scandir() from Python 3.5 and greater import os x = [f.name for f in os.scandir() if f.is_file()] print(x) >>> ["calculator.bat","calculator.py"] # Another example with scandir (a little variation from docs.python.org) # This one is more efficient than os.listdir. # In this case, it shows the files only in the current directory # where the script is executed. import os with os.scandir() as i: for entry in i: if entry.is_file(): print(entry.name) >>> ebookmaker.py >>> error.PNG >>> exemaker.bat >>> guiprova.mp3 >>> setup.py >>> speakgui4.py >>> speak_gui2.py >>> speak_gui3.py >>> thumb.PNG Examples: Ex. 1: How many files are there in the subdirectories? In this example, we look for the number of files that are included in all the directory and its subdirectories. import os def count(dir, counter=0): "returns number of files in dir and subdirs" for pack in os.walk(dir): for f in pack[2]: counter += 1 return dir + " : " + str(counter) + "files" print(count("F:\python")) >>> "F:\python" : 12057 files" Ex.2: How to copy all files from a directory to another? A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder. import os import shutil from path import path destination = "F:\file_copied" # os.makedirs(destination) def copyfile(dir, filetype="pptx", counter=0): "Searches for pptx (or other - pptx is the default) files and copies them" for pack in os.walk(dir): for f in pack[2]: if f.endswith(filetype): fullpath = pack[0] + "\" + f print(fullpath) shutil.copy(fullpath, destination) counter += 1 if counter > 0: print("-" * 30) print(" ==> Found in: `" + dir + "` : " + str(counter) + " files ") for dir in os.listdir(): "searches for folders that starts with `_`" if dir[0] == "_": # copyfile(dir, filetype="pdf") copyfile(dir, filetype="txt") >>> _compiti18Compito Contabilit√† 1conti.txt >>> _compiti18Compito Contabilit√† 1modula4.txt >>> _compiti18Compito Contabilit√† 1moduloa4.txt >>> ------------------------ >>> ==> Found in: `_compiti18` : 3 files Ex. 3: How to get all the files in a txt file In case you want to create a txt file with all the file names: import os mylist = "" with open("filelist.txt", "w", encoding="utf-8") as file: for eachfile in os.listdir(): mylist += eachfile + " " file.write(mylist) Example: txt with all the files of an hard drive """ We are going to save a txt file with all the files in your directory. We will use the function walk() """ import os # see all the methods of os # print(*dir(os), sep=", ") listafile = [] percorso = [] with open("lista_file.txt", "w", encoding="utf-8") as testo: for root, dirs, files in os.walk("D:\"): for file in files: listafile.append(file) percorso.append(root + "\" + file) testo.write(file + " ") listafile.sort() print("N. of files", len(listafile)) with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato: for file in listafile: testo_ordinato.write(file + " ") with open("percorso.txt", "w", encoding="utf-8") as file_percorso: for file in percorso: file_percorso.write(file + " ") os.system("lista_file.txt") os.system("lista_file_ordinata.txt") os.system("percorso.txt") All the file of C: in one text file This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path. import os with open("file.txt", "w", encoding="utf-8") as filewrite: for r, d, f in os.walk("C:\"): for file in f: filewrite.write(f"{r + file} ") How to write a file with all paths in a folder of a type With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think. import os def searchfiles(extension=".ttf", folder="H:\"): "Create a txt file with all the file of a type" with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite: for r, d, f in os.walk(folder): for file in f: if file.endswith(extension): filewrite.write(f"{r + file} ") # looking for png file (fonts) in the hard disk H: searchfiles(".png", "H:\") >>> H:4bs_18Dolphins5.png >>> H:4bs_18Dolphins6.png >>> H:4bs_18Dolphins7.png >>> H:5_18marketing htmlassetsimageslogo2.png >>> H:7z001.png >>> H:7z002.png (New) Find all files and open them with tkinter GUI I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. import tkinter as tk import os def searchfiles(extension=".txt", folder="H:\"): "insert all files in the listbox" for r, d, f in os.walk(folder): for file in f: if file.endswith(extension): lb.insert(0, r + "\" + file) def open_file(): os.startfile(lb.get(lb.curselection()[0])) root = tk.Tk() root.geometry("400x400") bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\")) bt.pack() lb = tk.Listbox(root) lb.pack(fill="both", expand=1) lb.bind("<Double-Button>", lambda x: open_file()) root.mainloop() Answer #4 I just used the following which was quite simple. First open a console then cd to where you"ve downloaded your file like some-package.whl and use pip install some-package.whl Note: if pip.exe is not recognized, you may find it in the "Scripts" directory from where python has been installed. If pip is not installed, this page can help: How do I install pip on Windows? Note: for clarification If you copy the *.whl file to your local drive (ex. C:some-dirsome-file.whl) use the following command line parameters -- pip install C:/some-dir/some-file.whl Answer #5 Quick Answer: The simplest way to get row counts per group is by calling .size(), which returns a Series: df.groupby(["col1","col2"]).size() Usually you want this result as a DataFrame (instead of a Series) so you can do: df.groupby(["col1", "col2"]).size().reset_index(name="counts") If you want to find out how to calculate the row counts and other statistics for each group continue reading below. Detailed example: Consider the following example dataframe: In [2]: df Out[2]: col1 col2 col3 col4 col5 col6 0 A B 0.20 -0.61 -0.49 1.49 1 A B -1.53 -1.01 -0.39 1.82 2 A B -0.44 0.27 0.72 0.11 3 A B 0.28 -1.32 0.38 0.18 4 C D 0.12 0.59 0.81 0.66 5 C D -0.13 -1.65 -1.64 0.50 6 C D -1.42 -0.11 -0.18 -0.44 7 E F -0.00 1.42 -0.26 1.17 8 E F 0.91 -0.47 1.35 -0.34 9 G H 1.48 -0.63 -1.14 0.17 First let"s use .size() to get the row counts: In [3]: df.groupby(["col1", "col2"]).size() Out[3]: col1 col2 A B 4 C D 3 E F 2 G H 1 dtype: int64 Then let"s use .size().reset_index(name="counts") to get the row counts: In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts") Out[4]: col1 col2 counts 0 A B 4 1 C D 3 2 E F 2 3 G H 1 Including results for more statistics When you want to calculate statistics on grouped data, it usually looks like this: In [5]: (df ...: .groupby(["col1", "col2"]) ...: .agg({ ...: "col3": ["mean", "count"], ...: "col4": ["median", "min", "count"] ...: })) Out[5]: col4 col3 median min count mean count col1 col2 A B -0.810 -1.32 4 -0.372500 4 C D -0.110 -1.65 3 -0.476667 3 E F 0.475 -0.47 2 0.455000 2 G H -0.630 -0.63 1 1.480000 1 The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis. To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this: In [6]: gb = df.groupby(["col1", "col2"]) ...: counts = gb.size().to_frame(name="counts") ...: (counts ...: .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"})) ...: .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"})) ...: .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"})) ...: .reset_index() ...: ) ...: Out[6]: col1 col2 counts col3_mean col4_median col4_min 0 A B 4 -0.372500 -0.810 -1.32 1 C D 3 -0.476667 -0.110 -1.65 2 E F 2 0.455000 0.475 -0.47 3 G H 1 1.480000 -0.630 -0.63 Footnotes The code used to generate the test data is shown below: In [1]: import numpy as np ...: import pandas as pd ...: ...: keys = np.array([ ...: ["A", "B"], ...: ["A", "B"], ...: ["A", "B"], ...: ["A", "B"], ...: ["C", "D"], ...: ["C", "D"], ...: ["C", "D"], ...: ["E", "F"], ...: ["E", "F"], ...: ["G", "H"] ...: ]) ...: ...: df = pd.DataFrame( ...: np.hstack([keys,np.random.randn(10,4).round(2)]), ...: columns = ["col1", "col2", "col3", "col4", "col5", "col6"] ...: ) ...: ...: df[["col3", "col4", "col5", "col6"]] = ...: df[["col3", "col4", "col5", "col6"]].astype(float) ...: Disclaimer: If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it. Answer #6 Using a for loop, how do I access the loop index, from 1 to 5 in this case? Use enumerate to get the index with the element as you iterate: for index, item in enumerate(items): print(index, item) And note that Python"s indexes start at zero, so you would get 0 to 4 with the above. If you want the count, 1 to 5, do this: count = 0 # in case items is empty and you need it after the loop for count, item in enumerate(items, start=1): print(count, item) Unidiomatic control flow What you are asking for is the Pythonic equivalent of the following, which is the algorithm most programmers of lower-level languages would use: index = 0 # Python"s indexing starts at zero for item in items: # Python"s for loops are a "for each" loop print(index, item) index += 1 Or in languages that do not have a for-each loop: index = 0 while index < len(items): print(index, items[index]) index += 1 or sometimes more commonly (but unidiomatically) found in Python: for index in range(len(items)): print(index, items[index]) Use the Enumerate Function Python"s enumerate function reduces the visual clutter by hiding the accounting for the indexes, and encapsulating the iterable into another iterable (an enumerate object) that yields a two-item tuple of the index and the item that the original iterable would provide. That looks like this: for index, item in enumerate(items, start=0): # default is zero print(index, item) This code sample is fairly well the canonical example of the difference between code that is idiomatic of Python and code that is not. Idiomatic code is sophisticated (but not complicated) Python, written in the way that it was intended to be used. Idiomatic code is expected by the designers of the language, which means that usually this code is not just more readable, but also more efficient. Getting a count Even if you don"t need indexes as you go, but you need a count of the iterations (sometimes desirable) you can start with 1 and the final number will be your count. count = 0 # in case items is empty for count, item in enumerate(items, start=1): # default is zero print(item) print("there were {0} items printed".format(count)) The count seems to be more what you intend to ask for (as opposed to index) when you said you wanted from 1 to 5. Breaking it down - a step by step explanation To break these examples down, say we have a list of items that we want to iterate over with an index: items = ["a", "b", "c", "d", "e"] Now we pass this iterable to enumerate, creating an enumerate object: enumerate_object = enumerate(items) # the enumerate object We can pull the first item out of this iterable that we would get in a loop with the next function: iteration = next(enumerate_object) # first iteration from enumerate print(iteration) And we see we get a tuple of 0, the first index, and "a", the first item: (0, "a") we can use what is referred to as "sequence unpacking" to extract the elements from this two-tuple: index, item = iteration # 0, "a" = (0, "a") # essentially this. and when we inspect index, we find it refers to the first index, 0, and item refers to the first item, "a". >>> print(index) 0 >>> print(item) a Conclusion Python indexes start at zero To get these indexes from an iterable as you iterate over it, use the enumerate function Using enumerate in the idiomatic way (along with tuple unpacking) creates code that is more readable and maintainable: So do this: for index, item in enumerate(items, start=0): # Python indexes start at zero print(index, item) Answer #7 Getting some sort of modification date in a cross-platform way is easy - just call os.path.getmtime(path) and you"ll get the Unix timestamp of when the file at path was last modified. Getting file creation dates, on the other hand, is fiddly and platform-dependent, differing even between the three big OSes: On Windows, a file"s ctime (documented at https://msdn.microsoft.com/en-us/library/14h5k7ff.aspx) stores its creation date. You can access this in Python through os.path.getctime() or the .st_ctime attribute of the result of a call to os.stat(). This won"t work on Unix, where the ctime is the last time that the file"s attributes or content were changed. On Mac, as well as some other Unix-based OSes, you can use the .st_birthtime attribute of the result of a call to os.stat(). On Linux, this is currently impossible, at least without writing a C extension for Python. Although some file systems commonly used with Linux do store creation dates (for example, ext4 stores them in st_crtime) , the Linux kernel offers no way of accessing them; in particular, the structs it returns from stat() calls in C, as of the latest kernel version, don"t contain any creation date fields. You can also see that the identifier st_crtime doesn"t currently feature anywhere in the Python source. At least if you"re on ext4, the data is attached to the inodes in the file system, but there"s no convenient way of accessing it. The next-best thing on Linux is to access the file"s mtime, through either os.path.getmtime() or the .st_mtime attribute of an os.stat() result. This will give you the last time the file"s content was modified, which may be adequate for some use cases. Putting this all together, cross-platform code should look something like this... import os import platform def creation_date(path_to_file): """ Try to get the date that a file was created, falling back to when it was last modified if that isn"t possible. See http://stackoverflow.com/a/39501288/1709587 for explanation. """ if platform.system() == "Windows": return os.path.getctime(path_to_file) else: stat = os.stat(path_to_file) try: return stat.st_birthtime except AttributeError: # We"re probably on Linux. No easy way to get creation dates here, # so we"ll settle for when its content was last modified. return stat.st_mtime Answer #8 I noticed that every now and then I need to Google fopen all over again, just to build a mental image of what the primary differences between the modes are. So, I thought a diagram will be faster to read next time. Maybe someone else will find that helpful too. Answer #9 I would suggest using the duplicated method on the Pandas Index itself: df3 = df3[~df3.index.duplicated(keep="first")] While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable. Using the sample data provided: >>> %timeit df3.reset_index().drop_duplicates(subset="index", keep="first").set_index("index") 1000 loops, best of 3: 1.54 ms per loop >>> %timeit df3.groupby(df3.index).first() 1000 loops, best of 3: 580 ¬µs per loop >>> %timeit df3[~df3.index.duplicated(keep="first")] 1000 loops, best of 3: 307 ¬µs per loop Note that you can keep the last element by changing the keep argument to "last". It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul"s example): >>> %timeit df1.groupby(level=df1.index.names).last() 1000 loops, best of 3: 771 ¬µs per loop >>> %timeit df1[~df1.index.duplicated(keep="last")] 1000 loops, best of 3: 365 ¬µs per loop Answer #10 Here"s a concise solution which avoids regular expressions and slow in-Python loops: def principal_period(s): i = (s+s).find(s, 1, -1) return None if i == -1 else s[:i] See the Community Wiki answer started by @davidism for benchmark results. In summary, David Zhang"s solution is the clear winner, outperforming all others by at least 5x for the large example set. (That answer"s words, not mine.) This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python"s string.find. PHP program for finding the standard deviation of an array: StackOverflow Questions InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately Tried to perform REST GET through python requests with the following code and I got error. Code snip: import requests header = {"Authorization": "Bearer..."} url = az_base_url + az_subscription_id + "/resourcegroups/Default-Networking/resources?" + az_api_version r = requests.get(url, headers=header) Error: /usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/util/ssl_.py:79: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning. InsecurePlatformWarning My python version is 2.7.3. I tried to install urllib3 and requests[security] as some other thread suggests, I still got the same error. Wonder if anyone can provide some tips? Dynamic instantiation from string name of a class in dynamically imported module? In python, I have to instantiate certain class, knowing its name in a string, but this class "lives" in a dynamically imported module. An example follows: loader-class script: import sys class loader: def __init__(self, module_name, class_name): # both args are strings try: __import__(module_name) modul = sys.modules[module_name] instance = modul.class_name() # obviously this doesn"t works, here is my main problem! except ImportError: # manage import error some-dynamically-loaded-module script: class myName: # etc... I use this arrangement to make any dynamically-loaded-module to be used by the loader-class following certain predefined behaviours in the dyn-loaded-modules... pandas loc vs. iloc vs. at vs. iat? Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in Pandas. I"ve read the documentation but I"m struggling to understand the practical implications of the various localization/selection options. Is there a reason why I should ever use .loc or .iloc over at, and iat or vice versa? In what situations should I use which method? Note: future readers be aware that this question is old and was written before pandas v0.20 when there used to exist a function called .ix. This method was later split into two - loc and iloc - to make the explicit distinction between positional and label based indexing. Please beware that ix was discontinued due to inconsistent behavior and being hard to grok, and no longer exists in current versions of pandas (>= 1.0). How to get all of the immediate subdirectories in Python I"m trying to write a simple Python script that will copy a index.tpl to index.html in all of the subdirectories (with a few exceptions). I"m getting bogged down by trying to get the list of subdirectories. Standard deviation of a list I want to find mean and standard deviation of 1st, 2nd,... digits of several (Z) lists. For example, I have A_rank=[0.8,0.4,1.2,3.7,2.6,5.8] B_rank=[0.1,2.8,3.7,2.6,5,3.4] C_Rank=[1.2,3.4,0.5,0.1,2.5,6.1] # etc (up to Z_rank )... Now I want to take the mean and std of *_Rank[0], the mean and std of *_Rank[1], etc. (ie: mean and std of the 1st digit from all the (A..Z)_rank lists; the mean and std of the 2nd digit from all the (A..Z)_rank lists; the mean and std of the 3rd digit...; etc). sort eigenvalues and associated eigenvectors after using numpy.linalg.eig in python I"m using numpy.linalg.eig to obtain a list of eigenvalues and eigenvectors: A = someMatrixArray from numpy.linalg import eig as eigenValuesAndVectors solution = eigenValuesAndVectors(A) eigenValues = solution[0] eigenVectors = solution[1] I would like to sort my eigenvalues (e.g. from lowest to highest), in a way I know what is the associated eigenvector after the sorting. I"m not finding any way of doing that with python functions. Is there any simple way or do I have to code my sort version? Associativity of "in" in Python? I"m making a Python parser, and this is really confusing me: >>> 1 in [] in "a" False >>> (1 in []) in "a" TypeError: "in <string>" requires string as left operand, not bool >>> 1 in ([] in "a") TypeError: "in <string>" requires string as left operand, not list How exactly does in work in Python, with regards to associativity, etc.? Why do no two of these expressions behave the same way? How to calculate probability in a normal distribution given mean & standard deviation? How to calculate probability in normal distribution given mean, std in Python? I can always explicitly code my own function according to the definition like the OP in this question did: Calculating Probability of a Random Variable in a Distribution in Python Just wondering if there is a library function call will allow you to do this. In my imagine it would like this: nd = NormalDistribution(mu=100, std=12) p = nd.prob(98) There is a similar question in Perl: How can I compute the probability at a point given a normal distribution in Perl?. But I didn"t see one in Python. Numpy has a random.normal function, but it"s like sampling, not exactly what I want. Answer #1 The Python 3 range() object doesn"t produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration. The object also implements the object.__contains__ hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range. From the range() object documentation: The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed). So at a minimum, your range() object would do: class my_range: def __init__(self, start, stop=None, step=1, /): if stop is None: start, stop = 0, start self.start, self.stop, self.step = start, stop, step if step < 0: lo, hi, step = stop, start, -step else: lo, hi = start, stop self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1 def __iter__(self): current = self.start if self.step < 0: while current > self.stop: yield current current += self.step else: while current < self.stop: yield current current += self.step def __len__(self): return self.length def __getitem__(self, i): if i < 0: i += self.length if 0 <= i < self.length: return self.start + i * self.step raise IndexError("my_range object index out of range") def __contains__(self, num): if self.step < 0: if not (self.stop < num <= self.start): return False else: if not (self.start <= num < self.stop): return False return (num - self.start) % self.step == 0 This is still missing several things that a real range() supports (such as the .index() or .count() methods, hashing, equality testing, or slicing), but should give you an idea. I also simplified the __contains__ implementation to only focus on integer tests; if you give a real range() object a non-integer value (including subclasses of int), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test. * Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it‚Äôs all executed in optimised C code and Python stores integer values in 30-bit chunks, you‚Äôd run out of memory before you saw any performance impact due to the size of the integers involved here. Answer #2 You have four main options for converting types in pandas: to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().) astype() - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful). infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible. convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas" object to indicate a missing value). Read on for more detailed explanations and usage of each of these methods. 1. to_numeric() The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric(). This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate. Basic usage The input to to_numeric() is a Series or a single column of a DataFrame. >>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values >>> s 0 8 1 6 2 7.5 3 3 4 0.9 dtype: object >>> pd.to_numeric(s) # convert everything to float values 0 8.0 1 6.0 2 7.5 3 3.0 4 0.9 dtype: float64 As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it: # convert Series my_series = pd.to_numeric(my_series) # convert column "a" of a DataFrame df["a"] = pd.to_numeric(df["a"]) You can also use it to convert multiple columns of a DataFrame via the apply() method: # convert all columns of DataFrame df = df.apply(pd.to_numeric) # convert all columns of DataFrame # convert just columns "a" and "b" df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric) As long as your values can all be converted, that"s probably all you need. Error handling But what if some values can"t be converted to a numeric type? to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values. Here"s an example using a Series of strings s which has the object dtype: >>> s = pd.Series(["1", "2", "4.7", "pandas", "10"]) >>> s 0 1 1 2 2 4.7 3 pandas 4 10 dtype: object The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas": >>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise") ValueError: Unable to parse string Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument: >>> pd.to_numeric(s, errors="coerce") 0 1.0 1 2.0 2 4.7 3 NaN 4 10.0 dtype: float64 The third option for errors is just to ignore the operation if an invalid value is encountered: >>> pd.to_numeric(s, errors="ignore") # the original Series is returned untouched This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write: df.apply(pd.to_numeric, errors="ignore") The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone. Downcasting By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform). That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8? to_numeric() gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series s of integer type: >>> s = pd.Series([1, 2, -7]) >>> s 0 1 1 2 2 -7 dtype: int64 Downcasting to "integer" uses the smallest possible integer that can hold the values: >>> pd.to_numeric(s, downcast="integer") 0 1 1 2 2 -7 dtype: int8 Downcasting to "float" similarly picks a smaller than normal floating type: >>> pd.to_numeric(s, downcast="float") 0 1.0 1 2.0 2 -7.0 dtype: float32 2. astype() The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other. Basic usage Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype). Call the method on the object you want to convert and astype() will try and convert it for you: # convert all DataFrame columns to the int64 dtype df = df.astype(int) # convert column "a" to int64 dtype and "b" to complex type df = df.astype({"a": int, "b": complex}) # convert Series to float16 type s = s.astype(np.float16) # convert Series to Python strings s = s.astype(str) # convert Series to categorical type - see docs for more details s = s.astype("category") Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a NaN or inf value you"ll get an error trying to convert it to an integer. As of pandas 0.20.0, this error can be suppressed by passing errors="ignore". Your original object will be return untouched. Be careful astype() is powerful, but it will sometimes convert values "incorrectly". For example: >>> s = pd.Series([1, 2, -7]) >>> s 0 1 1 2 2 -7 dtype: int64 These are small integers, so how about converting to an unsigned 8-bit type to save memory? >>> s.astype(np.uint8) 0 1 1 2 2 249 dtype: uint8 The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)! Trying to downcast using pd.to_numeric(s, downcast="unsigned") instead could help prevent this error. 3. infer_objects() Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions). For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers: >>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object") >>> df.dtypes a object b object dtype: object Using infer_objects(), you can change the type of column "a" to int64: >>> df = df.infer_objects() >>> df.dtypes a int64 b object dtype: object Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead. 4. convert_dtypes() Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value. Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to Int64, a column of NumPy int32 values will become the pandas dtype Int32. With our object DataFrame df, we get the following result: >>> df.convert_dtypes().dtypes a Int64 b string dtype: object Since column "a" held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64). Column "b" contained string objects, so was changed to pandas" string dtype. By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False: >>> df.convert_dtypes(infer_objects=False).dtypes a object b string dtype: object Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran infer_dtype) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values. Answer #3 Since this question was asked in 2010, there has been real simplification in how to do simple multithreading with Python with map and pool. The code below comes from an article/blog post that you should definitely check out (no affiliation) - Parallelism in one line: A Better Model for Day to Day Threading Tasks. I"ll summarize below - it ends up being just a few lines of code: from multiprocessing.dummy import Pool as ThreadPool pool = ThreadPool(4) results = pool.map(my_function, my_array) Which is the multithreaded version of: results = [] for item in my_array: results.append(my_function(item)) Description Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence. Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end. Implementation Parallel versions of the map function are provided by two libraries:multiprocessing, and also its little known, but equally fantastic step child:multiprocessing.dummy. multiprocessing.dummy is exactly the same as multiprocessing module, but uses threads instead (an important distinction - use multiple processes for CPU-intensive tasks; threads for (and during) I/O): multiprocessing.dummy replicates the API of multiprocessing, but is no more than a wrapper around the threading module. import urllib2 from multiprocessing.dummy import Pool as ThreadPool urls = [ "http://www.python.org", "http://www.python.org/about/", "http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html", "http://www.python.org/doc/", "http://www.python.org/download/", "http://www.python.org/getit/", "http://www.python.org/community/", "https://wiki.python.org/moin/", ] # Make the Pool of workers pool = ThreadPool(4) # Open the URLs in their own threads # and return the results results = pool.map(urllib2.urlopen, urls) # Close the pool and wait for the work to finish pool.close() pool.join() And the timing results: Single thread: 14.4 seconds 4 Pool: 3.1 seconds 8 Pool: 1.4 seconds 13 Pool: 1.3 seconds Passing multiple arguments (works like this only in Python 3.3 and later): To pass multiple arrays: results = pool.starmap(function, zip(list_a, list_b)) Or to pass a constant and an array: results = pool.starmap(function, zip(itertools.repeat(constant), list_a)) If you are using an earlier version of Python, you can pass multiple arguments via this workaround). (Thanks to user136036 for the helpful comment.) Answer #4 How to iterate over rows in a DataFrame in Pandas? Answer: DON"T*! Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting. Do you want to print a DataFrame? Use DataFrame.to_string(). Do you want to compute something? In that case, search for methods in this order (list modified from here): Vectorization Cython routines List Comprehensions (vanilla for loop) DataFrame.apply(): i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space DataFrame.itertuples() and iteritems() DataFrame.iterrows() iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for. Appeal to Authority The documentation page on iteration has a huge red warning box that says: Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...]. * It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom. Faster than Looping: Vectorization, Cython A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem. If none exists, feel free to write your own using custom Cython extensions. Next Best Thing: List Comprehensions* List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks. The formula is simple, # Iterating over one column - `f` is some function that processes your data result = [f(x) for x in df["col"]] # Iterating over two columns, use `zip` result = [f(x, y) for x, y in zip(df["col1"], df["col2"])] # Iterating over multiple columns - same data type result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()] # Iterating over multiple columns - differing data type result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])] If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code. Caveats List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this. *Your mileage may vary for the reasons outlined in the Caveats section above. An Obvious Example Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above. Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy). I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one. Further Reading 10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions. Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations Are for-loops in pandas really bad? When should I care? - a detailed writeup by me on list comprehensions and their suitability for various operations (mainly ones involving non-numeric data) When should I (not) want to use pandas apply() in my code? - apply is slow (but not as slow as the iter* family. There are, however, situations where one can (or should) consider apply as a serious alternative, especially in some GroupBy operations). * Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize. Why I Wrote this Answer A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do. The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library. Answer #5 In Python, what is the purpose of __slots__ and what are the cases one should avoid this? TLDR: The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results: faster attribute access. space savings in memory. The space savings is from Storing value references in slots instead of __dict__. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__. Quick Caveats Small caveat, you should only declare a particular slot one time in an inheritance tree. For example: class Base: __slots__ = "foo", "bar" class Right(Base): __slots__ = "baz", class Wrong(Base): __slots__ = "foo", "bar", "baz" # redundant foo and bar Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8: >>> from sys import getsizeof >>> getsizeof(Right()), getsizeof(Wrong()) (56, 72) This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could: >>> w = Wrong() >>> w.foo = "foo" >>> Base.foo.__get__(w) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: foo >>> Wrong.foo.__get__(w) "foo" The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined. To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library). See section on multiple inheritance below for an example. Requirements: To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2). To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry. There are a lot of details if you wish to keep reading. Why use __slots__: Faster attribute access. The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access. It is trivial to demonstrate measurably significant faster access: import timeit class Foo(object): __slots__ = "foo", class Bar(object): pass slotted = Foo() not_slotted = Bar() def get_set_delete_fn(obj): def get_set_delete(): obj.foo = "foo" obj.foo del obj.foo return get_set_delete and >>> min(timeit.repeat(get_set_delete_fn(slotted))) 0.2846834529991611 >>> min(timeit.repeat(get_set_delete_fn(not_slotted))) 0.3664822799983085 The slotted access is almost 30% faster in Python 3.5 on Ubuntu. >>> 0.3664822799983085 / 0.2846834529991611 1.2873325658284342 In Python 2 on Windows I have measured it about 15% faster. Why use __slots__: Memory Savings Another purpose of __slots__ is to reduce the space in memory that each object instance takes up. My own contribution to the documentation clearly states the reasons behind this: The space saved over using __dict__ can be significant. SQLAlchemy attributes a lot of memory savings to __slots__. To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally. In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two. For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes): Python 2.7 Python 3.6 attrs __slots__ __dict__* __slots__ __dict__* | *(no slots defined) none 16 56 + 272‚Ä† 16 56 + 112‚Ä† | ‚Ä†if __dict__ referenced one 48 56 + 272 48 56 + 112 two 56 56 + 272 56 56 + 112 six 88 56 + 1040 88 56 + 152 11 128 56 + 1040 128 56 + 240 22 216 56 + 3344 216 56 + 408 43 384 56 + 3344 384 56 + 752 So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__. Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members". >>> Foo.foo <member "foo" of "Foo" objects> >>> type(Foo.foo) <class "member_descriptor"> >>> getsizeof(Foo.foo) 72 Demonstration of __slots__: To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit: class Base(object): __slots__ = () now: >>> b = Base() >>> b.a = "a" Traceback (most recent call last): File "<pyshell#38>", line 1, in <module> b.a = "a" AttributeError: "Base" object has no attribute "a" Or subclass another class that defines __slots__ class Child(Base): __slots__ = ("a",) and now: c = Child() c.a = "a" but: >>> c.b = "b" Traceback (most recent call last): File "<pyshell#42>", line 1, in <module> c.b = "b" AttributeError: "Child" object has no attribute "b" To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes): class SlottedWithDict(Child): __slots__ = ("__dict__", "b") swd = SlottedWithDict() swd.a = "a" swd.b = "b" swd.c = "c" and >>> swd.__dict__ {"c": "c"} Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__: class NoSlots(Child): pass ns = NoSlots() ns.a = "a" ns.b = "b" And: >>> ns.__dict__ {"b": "b"} However, __slots__ may cause problems for multiple inheritance: class BaseA(object): __slots__ = ("a",) class BaseB(object): __slots__ = ("b",) Because creating a child class from parents with both non-empty slots fails: >>> class Child(BaseA, BaseB): __slots__ = () Traceback (most recent call last): File "<pyshell#68>", line 1, in <module> class Child(BaseA, BaseB): __slots__ = () TypeError: Error when calling the metaclass bases multiple bases have instance lay-out conflict If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions: from abc import ABC class AbstractA(ABC): __slots__ = () class BaseA(AbstractA): __slots__ = ("a",) class AbstractB(ABC): __slots__ = () class BaseB(AbstractB): __slots__ = ("b",) class Child(AbstractA, AbstractB): __slots__ = ("a", "b") c = Child() # no problem! Add "__dict__" to __slots__ to get dynamic assignment: class Foo(object): __slots__ = "bar", "baz", "__dict__" and now: >>> foo = Foo() >>> foo.boink = "boink" So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect. When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__. Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required. You can similarly add __weakref__ to __slots__ explicitly if you need that feature. Set to empty tuple when subclassing a namedtuple: The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them: from collections import namedtuple class MyNT(namedtuple("MyNT", "bar baz")): """MyNT is an immutable and lightweight object""" __slots__ = () usage: >>> nt = MyNT("bar", "baz") >>> nt.bar "bar" >>> nt.baz "baz" And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__: >>> nt.quux = "quux" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: "MyNT" object has no attribute "quux" You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple. Biggest Caveat: Multiple inheritance Even when non-empty slots are the same for multiple parents, they cannot be used together: class Foo(object): __slots__ = "foo", "bar" class Bar(object): __slots__ = "foo", "bar" # alas, would work if empty, i.e. () >>> class Baz(Foo, Bar): pass Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Error when calling the metaclass bases multiple bases have instance lay-out conflict Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__: class Foo(object): __slots__ = () class Bar(object): __slots__ = () class Baz(Foo, Bar): __slots__ = ("foo", "bar") b = Baz() b.foo, b.bar = "foo", "bar" You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems. Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers. To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance class AbstractBase: __slots__ = () def __init__(self, a, b): self.a = a self.b = b def __repr__(self): return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})" We could use the above directly by inheriting and declaring the expected slots: class Foo(AbstractBase): __slots__ = "a", "b" But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute: class AbstractBaseC: __slots__ = () @property def c(self): print("getting c!") return self._c @c.setter def c(self, arg): print("setting c!") self._c = arg Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong): class Concretion(AbstractBase, AbstractBaseC): __slots__ = "a b _c".split() And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation: >>> c = Concretion("a", "b") >>> c.c = c setting c! >>> c.c getting c! Concretion("a", "b") >>> c.d = "d" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: "Concretion" object has no attribute "d" Other cases to avoid slots: Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.) Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them. Avoid them if you insist on providing default values via class attributes for instance variables. You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to. Critiques of other answers The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways. Do not "only use __slots__ when instantiating lots of objects" I quote: "You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class." Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them. Why? If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes. __slots__ contributes to reusability when creating interfaces or mixins. It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable. __slots__ doesn"t break pickling When pickling a slotted object, you may find it complains with a misleading TypeError: >>> pickle.loads(pickle.dumps(f)) TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4. >>> pickle.loads(pickle.dumps(f, -1)) <__main__.Foo object at 0x1129C770> in Python 2.7: >>> pickle.loads(pickle.dumps(f, 2)) <__main__.Foo object at 0x1129C770> in Python 3.6 >>> pickle.loads(pickle.dumps(f, 4)) <__main__.Foo object at 0x1129C770> So I would keep this in mind, as it is a solved problem. Critique of the (until Oct 2, 2016) accepted answer The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots The second half is wishful thinking, and off the mark: While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object. Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous. The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.): They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies. It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__. The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site. Memory usage evidence Create some normal objects and slotted objects: >>> class Foo(object): pass >>> class Bar(object): __slots__ = () Instantiate a million of them: >>> foos = [Foo() for f in xrange(1000000)] >>> bars = [Bar() for b in xrange(1000000)] Inspect with guppy.hpy().heap(): >>> guppy.hpy().heap() Partition of a set of 2028259 objects. Total size = 99763360 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1000000 49 64000000 64 64000000 64 __main__.Foo 1 169 0 16281480 16 80281480 80 list 2 1000000 49 16000000 16 96281480 97 __main__.Bar 3 12284 1 987472 1 97268952 97 str ... Access the regular objects and their __dict__ and inspect again: >>> for f in foos: ... f.__dict__ >>> guppy.hpy().heap() Partition of a set of 3028258 objects. Total size = 379763480 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 1000000 33 280000000 74 280000000 74 dict of __main__.Foo 1 1000000 33 64000000 17 344000000 91 __main__.Foo 2 169 0 16281480 4 360281480 95 list 3 1000000 33 16000000 4 376281480 99 __main__.Bar 4 12284 0 987472 0 377268952 99 str ... This is consistent with the history of Python, from Unifying types and classes in Python 2.2 If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class. Answer #6 os.listdir() - list in the current directory With listdir in os module you get the files and the folders in the current dir import os arr = os.listdir() print(arr) >>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"] Looking in a directory arr = os.listdir("c:\files") glob from glob with glob you can specify a type of file to list like this import glob txtfiles = [] for file in glob.glob("*.txt"): txtfiles.append(file) glob in a list comprehension mylist = [f for f in glob.glob("*.txt")] get the full path of only files in the current directory import os from os import listdir from os.path import isfile, join cwd = os.getcwd() onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if os.path.isfile(os.path.join(cwd, f))] print(onlyfiles) ["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"] Getting the full path name with os.path.abspath You get the full path in return import os files_path = [os.path.abspath(x) for x in os.listdir()] print(files_path) ["F:\documentiapplications.txt", "F:\documenticollections.txt"] Walk: going through sub directories os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders. import os # Getting the current work directory (cwd) thisdir = os.getcwd() # r=root, d=directories, f = files for r, d, f in os.walk(thisdir): for file in f: if file.endswith(".docx"): print(os.path.join(r, file)) os.listdir(): get files in the current directory (Python 2) In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method. import os arr = os.listdir(".") print(arr) >>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"] To go up in the directory tree # Method 1 x = os.listdir("..") # Method 2 x= os.listdir("/") Get files: os.listdir() in a particular directory (Python 2 and 3) import os arr = os.listdir("F:\python") print(arr) >>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"] Get files of a particular subdirectory with os.listdir() import os x = os.listdir("./content") os.walk(".") - current directory import os arr = next(os.walk("."))[2] print(arr) >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"] next(os.walk(".")) and os.path.join("dir", "file") import os arr = [] for d,r,f in next(os.walk("F:\_python")): for file in f: arr.append(os.path.join(r,file)) for f in arr: print(files) >>> F:\_python\dict_class.py >>> F:\_python\programmi.txt next(os.walk("F:\") - get the full path - list comprehension [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f] >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"] os.walk - get full path - all files in sub dirs** x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f] print(x) >>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"] os.listdir() - get only txt files arr_txt = [x for x in os.listdir() if x.endswith(".txt")] print(arr_txt) >>> ["work.txt", "3ebooks.txt"] Using glob to get the full path of the files If I should need the absolute path of the files: from path import path from glob import glob x = [path(f).abspath() for f in glob("F:\*.txt")] for f in x: print(f) >>> F:acquistionline.txt >>> F:acquisti_2018.txt >>> F:ootstrap_jquery_ecc.txt Using os.path.isfile to avoid directories in the list import os.path listOfFiles = [f for f in os.listdir() if os.path.isfile(f)] print(listOfFiles) >>> ["a simple game.py", "data.txt", "decorator.py"] Using pathlib from Python 3.4 import pathlib flist = [] for p in pathlib.Path(".").iterdir(): if p.is_file(): print(p) flist.append(p) >>> error.PNG >>> exemaker.bat >>> guiprova.mp3 >>> setup.py >>> speak_gui2.py >>> thumb.PNG With list comprehension: flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()] Alternatively, use pathlib.Path() instead of pathlib.Path(".") Use glob method in pathlib.Path() import pathlib py = pathlib.Path().glob("*.py") for file in py: print(file) >>> stack_overflow_list.py >>> stack_overflow_list_tkinter.py Get all and only files with os.walk import os x = [i[2] for i in os.walk(".")] y=[] for t in x: for f in t: y.append(f) print(y) >>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"] Get only files with next and walk in a directory import os x = next(os.walk("F://python"))[2] print(x) >>> ["calculator.bat","calculator.py"] Get only directories with next and walk in a directory import os next(os.walk("F://python"))[1] # for the current dir use (".") >>> ["python3","others"] Get all the subdir names with walk for r,d,f in os.walk("F:\_python"): for dirs in d: print(dirs) >>> .vscode >>> pyexcel >>> pyschool.py >>> subtitles >>> _metaprogramming >>> .ipynb_checkpoints os.scandir() from Python 3.5 and greater import os x = [f.name for f in os.scandir() if f.is_file()] print(x) >>> ["calculator.bat","calculator.py"] # Another example with scandir (a little variation from docs.python.org) # This one is more efficient than os.listdir. # In this case, it shows the files only in the current directory # where the script is executed. import os with os.scandir() as i: for entry in i: if entry.is_file(): print(entry.name) >>> ebookmaker.py >>> error.PNG >>> exemaker.bat >>> guiprova.mp3 >>> setup.py >>> speakgui4.py >>> speak_gui2.py >>> speak_gui3.py >>> thumb.PNG Examples: Ex. 1: How many files are there in the subdirectories? In this example, we look for the number of files that are included in all the directory and its subdirectories. import os def count(dir, counter=0): "returns number of files in dir and subdirs" for pack in os.walk(dir): for f in pack[2]: counter += 1 return dir + " : " + str(counter) + "files" print(count("F:\python")) >>> "F:\python" : 12057 files" Ex.2: How to copy all files from a directory to another? A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder. import os import shutil from path import path destination = "F:\file_copied" # os.makedirs(destination) def copyfile(dir, filetype="pptx", counter=0): "Searches for pptx (or other - pptx is the default) files and copies them" for pack in os.walk(dir): for f in pack[2]: if f.endswith(filetype): fullpath = pack[0] + "\" + f print(fullpath) shutil.copy(fullpath, destination) counter += 1 if counter > 0: print("-" * 30) print(" ==> Found in: `" + dir + "` : " + str(counter) + " files ") for dir in os.listdir(): "searches for folders that starts with `_`" if dir[0] == "_": # copyfile(dir, filetype="pdf") copyfile(dir, filetype="txt") >>> _compiti18Compito Contabilit√† 1conti.txt >>> _compiti18Compito Contabilit√† 1modula4.txt >>> _compiti18Compito Contabilit√† 1moduloa4.txt >>> ------------------------ >>> ==> Found in: `_compiti18` : 3 files Ex. 3: How to get all the files in a txt file In case you want to create a txt file with all the file names: import os mylist = "" with open("filelist.txt", "w", encoding="utf-8") as file: for eachfile in os.listdir(): mylist += eachfile + " " file.write(mylist) Example: txt with all the files of an hard drive """ We are going to save a txt file with all the files in your directory. We will use the function walk() """ import os # see all the methods of os # print(*dir(os), sep=", ") listafile = [] percorso = [] with open("lista_file.txt", "w", encoding="utf-8") as testo: for root, dirs, files in os.walk("D:\"): for file in files: listafile.append(file) percorso.append(root + "\" + file) testo.write(file + " ") listafile.sort() print("N. of files", len(listafile)) with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato: for file in listafile: testo_ordinato.write(file + " ") with open("percorso.txt", "w", encoding="utf-8") as file_percorso: for file in percorso: file_percorso.write(file + " ") os.system("lista_file.txt") os.system("lista_file_ordinata.txt") os.system("percorso.txt") All the file of C: in one text file This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path. import os with open("file.txt", "w", encoding="utf-8") as filewrite: for r, d, f in os.walk("C:\"): for file in f: filewrite.write(f"{r + file} ") How to write a file with all paths in a folder of a type With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think. import os def searchfiles(extension=".ttf", folder="H:\"): "Create a txt file with all the file of a type" with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite: for r, d, f in os.walk(folder): for file in f: if file.endswith(extension): filewrite.write(f"{r + file} ") # looking for png file (fonts) in the hard disk H: searchfiles(".png", "H:\") >>> H:4bs_18Dolphins5.png >>> H:4bs_18Dolphins6.png >>> H:4bs_18Dolphins7.png >>> H:5_18marketing htmlassetsimageslogo2.png >>> H:7z001.png >>> H:7z002.png (New) Find all files and open them with tkinter GUI I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. import tkinter as tk import os def searchfiles(extension=".txt", folder="H:\"): "insert all files in the listbox" for r, d, f in os.walk(folder): for file in f: if file.endswith(extension): lb.insert(0, r + "\" + file) def open_file(): os.startfile(lb.get(lb.curselection()[0])) root = tk.Tk() root.geometry("400x400") bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\")) bt.pack() lb = tk.Listbox(root) lb.pack(fill="both", expand=1) lb.bind("<Double-Button>", lambda x: open_file()) root.mainloop() Answer #7 This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it. In particular, here"s what this post will go through: The basics - types of joins (LEFT, RIGHT, OUTER, INNER) merging with different column names merging with multiple columns avoiding duplicate merge key column in output What this post (and other posts by me on this thread) will not go through: Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate. Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out! Note Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified. Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard. Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here. Enough talk - just show me how to use merge! Setup & Basics np.random.seed(0) left = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) right = pd.DataFrame({"key": ["B", "D", "E", "F"], "value": np.random.randn(4)}) left key value 0 A 1.764052 1 B 0.400157 2 C 0.978738 3 D 2.240893 right key value 0 B 1.867558 1 D -0.977278 2 E 0.950088 3 F -0.151357 For the sake of simplicity, the key column has the same name (for now). An INNER JOIN is represented by Note This, along with the forthcoming figures all follow this convention: blue indicates rows that are present in the merge result red indicates rows that are excluded from the result (i.e., removed) green indicates missing values that are replaced with NaNs in the result To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments. left.merge(right, on="key") # Or, if you want to be explicit # left.merge(right, on="key", how="inner") key value_x value_y 0 B 0.400157 1.867558 1 D 2.240893 -0.977278 This returns only rows from left and right which share a common key (in this example, "B" and "D). A LEFT OUTER JOIN, or LEFT JOIN is represented by This can be performed by specifying how="left". left.merge(right, on="key", how="left") key value_x value_y 0 A 1.764052 NaN 1 B 0.400157 1.867558 2 C 0.978738 NaN 3 D 2.240893 -0.977278 Carefully note the placement of NaNs here. If you specify how="left", then only keys from left are used, and missing data from right is replaced by NaN. And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is... ...specify how="right": left.merge(right, on="key", how="right") key value_x value_y 0 B 0.400157 1.867558 1 D 2.240893 -0.977278 2 E NaN 0.950088 3 F NaN -0.151357 Here, keys from right are used, and missing data from left is replaced by NaN. Finally, for the FULL OUTER JOIN, given by specify how="outer". left.merge(right, on="key", how="outer") key value_x value_y 0 A 1.764052 NaN 1 B 0.400157 1.867558 2 C 0.978738 NaN 3 D 2.240893 -0.977278 4 E NaN 0.950088 5 F NaN -0.151357 This uses the keys from both frames, and NaNs are inserted for missing rows in both. The documentation summarizes these various merges nicely: Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps. For LEFT-Excluding JOIN, represented as Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only, (left.merge(right, on="key", how="left", indicator=True) .query("_merge == "left_only"") .drop("_merge", 1)) key value_x value_y 0 A 1.764052 NaN 2 C 0.978738 NaN Where, left.merge(right, on="key", how="left", indicator=True) key value_x value_y _merge 0 A 1.764052 NaN left_only 1 B 0.400157 1.867558 both 2 C 0.978738 NaN left_only 3 D 2.240893 -0.977278 both And similarly, for a RIGHT-Excluding JOIN, (left.merge(right, on="key", how="right", indicator=True) .query("_merge == "right_only"") .drop("_merge", 1)) key value_x value_y 2 E NaN 0.950088 3 F NaN -0.151357 Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), You can do this in similar fashion‚Äî (left.merge(right, on="key", how="outer", indicator=True) .query("_merge != "both"") .drop("_merge", 1)) key value_x value_y 0 A 1.764052 NaN 2 C 0.978738 NaN 4 E NaN 0.950088 5 F NaN -0.151357 Different names for key columns If the key columns are named differently‚Äîfor example, left has keyLeft, and right has keyRight instead of key‚Äîthen you will have to specify left_on and right_on as arguments instead of on: left2 = left.rename({"key":"keyLeft"}, axis=1) right2 = right.rename({"key":"keyRight"}, axis=1) left2 keyLeft value 0 A 1.764052 1 B 0.400157 2 C 0.978738 3 D 2.240893 right2 keyRight value 0 B 1.867558 1 D -0.977278 2 E 0.950088 3 F -0.151357 left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner") keyLeft value_x keyRight value_y 0 B 0.400157 B 1.867558 1 D 2.240893 D -0.977278 Avoiding duplicate key column in output When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step. left3 = left2.set_index("keyLeft") left3.merge(right2, left_index=True, right_on="keyRight") value_x keyRight value_y 0 0.400157 B 1.867558 1 2.240893 D -0.977278 Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")), you"ll notice keyLeft is missing. You can figure out what column to keep based on which frame"s index is set as the key. This may matter when, say, performing some OUTER JOIN operation. Merging only a single column from one of the DataFrames For example, consider right3 = right.assign(newcol=np.arange(len(right))) right3 key value newcol 0 B 1.867558 0 1 D -0.977278 1 2 E 0.950088 2 3 F -0.151357 3 If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging: left.merge(right3[["key", "newcol"]], on="key") key value newcol 0 B 0.400157 0 1 D 2.240893 1 If you"re doing a LEFT OUTER JOIN, a more performant solution would involve map: # left["newcol"] = left["key"].map(right3.set_index("key")["newcol"])) left.assign(newcol=left["key"].map(right3.set_index("key")["newcol"])) key value newcol 0 A 1.764052 NaN 1 B 0.400157 0.0 2 C 0.978738 NaN 3 D 2.240893 1.0 As mentioned, this is similar to, but faster than left.merge(right3[["key", "newcol"]], on="key", how="left") key value newcol 0 A 1.764052 NaN 1 B 0.400157 0.0 2 C 0.978738 NaN 3 D 2.240893 1.0 Merging on multiple columns To join on more than one column, specify a list for on (or left_on and right_on, as appropriate). left.merge(right, on=["key1", "key2"] ...) Or, in the event the names are different, left.merge(right, left_on=["lkey1", "lkey2"], right_on=["rkey1", "rkey2"]) Other useful merge* operations and functions Merging a DataFrame with Series on index: See this answer. Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another. pd.merge_ordered is a useful function for ordered JOINs. pd.merge_asof (read: merge_asOf) is useful for approximate joins. This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications. Continue Reading Jump to other topics in Pandas Merging 101 to continue learning: Merging basics - basic types of joins * Index-based joins Generalizing to multiple DataFrames Cross join *You are here. Answer #8 tl;dr / quick fix Don"t decode/encode willy nilly Don"t assume your strings are UTF-8 encoded Try to convert strings to Unicode strings as soon as possible in your code Fix your locale: How to solve UnicodeDecodeError in Python 3.6? Don"t be tempted to use quick reload hacks Unicode Zen in Python 2.x - The Long Version Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally. UnicodeDecodeError: "ascii" codec can"t decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string. In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings. The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown. Unicode strings can be declared in your code using the u prefix to strings. E.g. >>> my_u = u"my √ºnic√¥d√© strƒØng" >>> type(my_u) <type "unicode"> Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding. Gotchas Conversion from str to Unicode can happen even when you don"t explicitly call unicode(). The following scenarios cause UnicodeDecodeError exceptions: # Explicit conversion without encoding unicode("‚Ç¨") # New style format string into Unicode string # Python will try to convert value string to Unicode first u"The currency is: {}".format("‚Ç¨") # Old style format string into Unicode string # Python will try to convert value string to Unicode first u"The currency is: %s" % "‚Ç¨" # Append string to Unicode # Python will try to convert string to Unicode first u"The currency is: " + "‚Ç¨" Examples In the following diagram, you can see how the word caf√© has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, √© is encoded using two bytes. In "Cp1252", √© is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull: In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can"t contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception: The Unicode Sandwich It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code. Input / Decode Source code If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g. u"Z√ºrich" To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use: # encoding: utf-8 This is only necessary when you have non-ASCII in your source code. Files Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file: import io with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file: my_unicode_string = my_file.read() my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you"ve probably used the wrong encoding value. CSV Files The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv. Use it like above but pass the opened file to it: from backports import csv import io with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file: for row in csv.reader(my_file): yield row Databases Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries. MySQL In the connection string add: charset="utf8", use_unicode=True E.g. >>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8") PostgreSQL Add: psycopg2.extensions.register_type(psycopg2.extensions.UNICODE) psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY) HTTP Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text. Manually If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you"ve probably got the wrong encoding. The meat of the sandwich Work with Unicodes as you would normal strs. Output stdout / printing print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page. An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout. Files Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings. Database The same configuration for reading will allow Unicodes to be written directly. Python 3 Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes. The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems. Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes. Why you shouldn"t use sys.setdefaultencoding("utf8") It"s a nasty hack (there"s a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details Answer #9 Clear the cache directory where appropriate for your system Linux and Unix ~/.cache/pip # and it respects the XDG_CACHE_HOME directory. OS X ~/Library/Caches/pip Windows %LocalAppData%pipCache UPDATE With pip 20.1 or later, you can find the full path for your operating system easily by typing this in the command line: pip cache dir Example output on my Ubuntu installation: ‚ûú pip3 cache dir /home/tawanda/.cache/pip Answer #10 Calculate timestamps within your DB, not your client For sanity, you probably want to have all datetimes calculated by your DB server, rather than the application server. Calculating the timestamp in the application can lead to problems because network latency is variable, clients experience slightly different clock drift, and different programming languages occasionally calculate time slightly differently. SQLAlchemy allows you to do this by passing func.now() or func.current_timestamp() (they are aliases of each other) which tells the DB to calculate the timestamp itself. Use SQLALchemy"s server_default Additionally, for a default where you"re already telling the DB to calculate the value, it"s generally better to use server_default instead of default. This tells SQLAlchemy to pass the default value as part of the CREATE TABLE statement. For example, if you write an ad hoc script against this table, using server_default means you won"t need to worry about manually adding a timestamp call to your script--the database will set it automatically. Understanding SQLAlchemy"s onupdate/server_onupdate SQLAlchemy also supports onupdate so that anytime the row is updated it inserts a new timestamp. Again, best to tell the DB to calculate the timestamp itself: from sqlalchemy.sql import func time_created = Column(DateTime(timezone=True), server_default=func.now()) time_updated = Column(DateTime(timezone=True), onupdate=func.now()) There is a server_onupdate parameter, but unlike server_default, it doesn"t actually set anything serverside. It just tells SQLalchemy that your database will change the column when an update happens (perhaps you created a trigger on the column ), so SQLAlchemy will ask for the return value so it can update the corresponding object. One other potential gotcha: You might be surprised to notice that if you make a bunch of changes within a single transaction, they all have the same timestamp. That"s because the SQL standard specifies that CURRENT_TIMESTAMP returns values based on the start of the transaction. PostgreSQL provides the non-SQL-standard statement_timestamp() and clock_timestamp() which do change within a transaction. Docs here: https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-CURRENT UTC timestamp If you want to use UTC timestamps, a stub of implementation for func.utcnow() is provided in SQLAlchemy documentation. You need to provide appropriate driver-specific functions on your own though. ```
``` Tutorials (adsbygoogle = window.adsbygoogle || []).push({}); Python OpenCV | cv2.putText () method numpy.arctan2 () in Python Python | os.path.realpath () method Python OpenCV | cv2.circle () method Python OpenCV | cv2.cvtColor () method Python - Move item to the end of the list time.perf_counter () function in Python Check if one list is a subset of another in Python Python | os.path.join () method Python | os.path.expanduser () method Replace negative value with zero in Numpy array Python | Convert nested dictionary list to Pandas dataframe TimeField — Django Models Python | os.path.isdir () method Python OpenCV cv2.line () method numpy.poly1d () in Python Numpy.sqrt () in Python Split nested list into two lists in Python Python program to convert camel case string to snake case Python | Pandas DataFrame.reset_index () Binning in Data Mining Python | Add suffix / prefix to strings in a list Dunn index Python Python - Print list vertically Books for developers Python Data Science Handbook 22/08/2021 CEH v11 Certified Ethical Hacker Study Guide 12/08/2021 Data Science from Scratch. First Principles with Python 23/09/2020 Python Workout: 50 TEN-MINUTE EXERCISES 23/09/2020 © 2022 Python.Engineering Best Python tutorials books for beginners and professionals Python.Engineering is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com Python in Italiano Python auf Deutsch Python en Français Python en Español Türk dilinde Python Python: мануалы на русском ITIL v4 Computations Development Cryptography For dummies Machine Learning Big Data Loops Counters NumPy NLP PHP Regular Expressions File Handling Arrays String Variables Knowledge Database X Submit new EBook \$(document).ready(function () { \$(".modal_galery").owlCarousel({ items: 1, itemsCustom: false, itemsDesktop: [1300, 1], itemsDesktopSmall: [960, 1], itemsTablet: [768, 1], itemsTabletSmall: false, itemsMobile: [479, 1], singleItem: false, itemsScaleUp: false, pagination: false, navigation: true, rewindNav: true, autoPlay: true, stopOnHover: true, navigationText: [ "<img class='img_no_nav_mob' src='/wp-content/themes/nimani/image/prevCopy.png'>", "<img class='img_no_nav_mob' src='/wp-content/themes/nimani/image/nextCopy.png'>" ], }); \$(".tel_mask").mask("+9(999) 999-99-99"); }) ```