# Python sympy | Matrix.diagonalize () method

diag | Python Methods and Functions

Using the sympy.Matrix () method. Diagonalize () we can diagonalize the matrix.  diagonalize () returns a tuple , where diagonally and ,

Syntax: Matrix (). Diagonalize ()

Returns: Returns a tuple of matrix where the second element represents the diagonal of the matrix.

Example # 1:

 ` # import sympy ` ` from ` ` sympy ` ` import ` ` * ` ` M ` ` = ` ` Matrix ([[` ` 3 ` `, ` ` - 2 ,  4 , - 2 ], `` [ 5 ,  3 , - 3 , - 2 ],   [ 5 , - 2 ,  2 , - 2 ],   [< / code> 5 , - 2 , - 3 ,  3 ]])    print ( "Matrix: { } " . format (M))   # Use the sympy.diagonalize () method P, D = M.diagonalize ()    print ( "Diagonal of a matrix: {}" . format (D)) `

Exit:

Matrix: Matrix ([[3, -2, 4, -2], [5, 3, -3, -2], [5, -2, 2, -2], [5, -2, -3, 3]])
Diagonal of a matrix: Matrix ([[- 2, 0, 0, 0], [0, 3, 0, 0], [0, 0, 5, 0], [0, 0, 0, 5]])

Example # 2:

 ` # import sympy ` ` from ` ` sympy ` ` import ` ` * ` ` M ` ` = ` ` Matrix ([[` ` 1 ` `, ` ` - ` ` 3 ` `, ` ` 3 ` `], [` ` 3 ` , ` - ` ` 5 ` `, ` ` 3 ` `], [` ` 6 ` ` , ` ` - ` ` 6 ` `, ` ` 4 ` `]]) ` ` print ` ` (` ` "Matrix: {}" ` `. ` ` format ` ` (M)) `   ` # Use the method sympy.diagonalize () ` ` P, D ` ` = ` ` M.diagonalize () ` ` `  ` print ` ` (` ` "Diagonal of a matrix: {}" ` . ` format ` ` (D)) `

Exit:

Matrix: Matrix ([[1 , -3, 3], [3, -5, 3], [6, -6, 4]])
Diagonal of a matrix: Matrix ([[- 2, 0, 0], [0, -2, 0], [0, 0, 4]])

## What"s the best way to generate a UML diagram from Python source code?

### Question by Mike Pirnat

A colleague is looking to generate UML class diagrams from heaps of Python source code. He"s primarily interested in the inheritance relationships, and mildly interested in compositional relationships, and doesn"t care much about class attributes that are just Python primitives.

The source code is pretty straightforward and not tremendously evil--it doesn"t do any fancy metaclass magic, for example. (It"s mostly from the days of Python 1.5.2, with some sprinklings of "modern" 2.3ish stuff.)

What"s the best existing solution to recommend?

I noticed that every now and then I need to Google fopen all over again, just to build a mental image of what the primary differences between the modes are. So, I thought a diagram will be faster to read next time. Maybe someone else will find that helpful too.

# tl;dr / quick fix

• Don"t decode/encode willy nilly
• Don"t assume your strings are UTF-8 encoded
• Try to convert strings to Unicode strings as soon as possible in your code
• Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
• Don"t be tempted to use quick `reload` hacks

# Unicode Zen in Python 2.x - The Long Version

Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

`UnicodeDecodeError: "ascii" codec can"t decode byte` generally happens when you try to convert a Python 2.x `str` that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use `unicode()` (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the `u` prefix to strings. E.g.

``````>>> my_u = u"my √ºnic√¥d√© strƒØng"
>>> type(my_u)
<type "unicode">
``````

Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

# Gotchas

Conversion from `str` to Unicode can happen even when you don"t explicitly call `unicode()`.

The following scenarios cause `UnicodeDecodeError` exceptions:

``````# Explicit conversion without encoding
unicode("‚Ç¨")

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format("‚Ç¨")

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: %s" % "‚Ç¨"

# Append string to Unicode
# Python will try to convert string to Unicode first
u"The currency is: " + "‚Ç¨"
``````

## Examples

In the following diagram, you can see how the word `caf√©` has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, `caf` is just regular ascii. In UTF-8, `√©` is encoded using two bytes. In "Cp1252", √© is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct `decode()` is invoked and conversion to a Python Unicode is successfull:

In this diagram, `decode()` is called with `ascii` (which is the same as calling `unicode()` without an encoding given). As ASCII can"t contain bytes greater than `0x7F`, this will throw a `UnicodeDecodeError` exception:

# The Unicode Sandwich

It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to `str`s on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

## Input / Decode

### Source code

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a `u`. E.g.

``````u"Z√ºrich"
``````

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

``````# encoding: utf-8
``````

This is only necessary when you have non-ASCII in your source code.

### Files

Usually non-ASCII data is received from a file. The `io` module provides a TextWrapper that decodes your file on the fly, using a given `encoding`. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

``````import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
``````

`my_unicode_string` would then be suitable for passing to Markdown. If a `UnicodeDecodeError` from the `read()` line, then you"ve probably used the wrong encoding value.

### CSV Files

The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

``````from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
for row in csv.reader(my_file):
yield row
``````

### Databases

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQL

In the connection string add:

``````charset="utf8",
use_unicode=True
``````

E.g.

``````>>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
``````
PostgreSQL

``````psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
``````

### HTTP

Web pages can be encoded in just about any encoding. The `Content-type` header should contain a `charset` field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in `response.text`.

### Manually

If you must decode strings manually, you can simply do `my_string.decode(encoding)`, where `encoding` is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get `UnicodeDecodeError` then you"ve probably got the wrong encoding.

## The meat of the sandwich

Work with Unicodes as you would normal strs.

## Output

### stdout / printing

`print` writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s `locale` is `en_GB.UTF-8`, the output will be encoded to `UTF-8`. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. `PYTHONIOENCODING` environment variable can force the encoding for stdout.

### Files

Just like input, `io.open` can be used to transparently convert Unicodes to encoded byte strings.

### Database

The same configuration for reading will allow Unicodes to be written directly.

# Python 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular `str` is now a Unicode string and the old `str` is now `bytes`.

The default encoding is UTF-8, so if you `.decode()` a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

Further, `open()` operates in text mode by default, so returns decoded `str` (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

# Why you shouldn"t use `sys.setdefaultencoding("utf8")`

It"s a nasty hack (there"s a reason you have to use `reload`) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

(Note: this answer is based on a short blog post about `einsum` I wrote a while ago.)

## What does `einsum` do?

Imagine that we have two multi-dimensional arrays, `A` and `B`. Now let"s suppose we want to...

• multiply `A` with `B` in a particular way to create new array of products; and then maybe
• sum this new array along particular axes; and then maybe
• transpose the axes of the new array in a particular order.

There"s a good chance that `einsum` will help us do this faster and more memory-efficiently than combinations of the NumPy functions like `multiply`, `sum` and `transpose` will allow.

## How does `einsum` work?

Here"s a simple (but not completely trivial) example. Take the following two arrays:

``````A = np.array([0, 1, 2])

B = np.array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11]])
``````

We will multiply `A` and `B` element-wise and then sum along the rows of the new array. In "normal" NumPy we"d write:

``````>>> (A[:, np.newaxis] * B).sum(axis=1)
array([ 0, 22, 76])
``````

So here, the indexing operation on `A` lines up the first axes of the two arrays so that the multiplication can be broadcast. The rows of the array of products are then summed to return the answer.

Now if we wanted to use `einsum` instead, we could write:

``````>>> np.einsum("i,ij->i", A, B)
array([ 0, 22, 76])
``````

The signature string `"i,ij->i"` is the key here and needs a little bit of explaining. You can think of it in two halves. On the left-hand side (left of the `->`) we"ve labelled the two input arrays. To the right of `->`, we"ve labelled the array we want to end up with.

Here is what happens next:

• `A` has one axis; we"ve labelled it `i`. And `B` has two axes; we"ve labelled axis 0 as `i` and axis 1 as `j`.

• By repeating the label `i` in both input arrays, we are telling `einsum` that these two axes should be multiplied together. In other words, we"re multiplying array `A` with each column of array `B`, just like `A[:, np.newaxis] * B` does.

• Notice that `j` does not appear as a label in our desired output; we"ve just used `i` (we want to end up with a 1D array). By omitting the label, we"re telling `einsum` to sum along this axis. In other words, we"re summing the rows of the products, just like `.sum(axis=1)` does.

That"s basically all you need to know to use `einsum`. It helps to play about a little; if we leave both labels in the output, `"i,ij->ij"`, we get back a 2D array of products (same as `A[:, np.newaxis] * B`). If we say no output labels, `"i,ij->`, we get back a single number (same as doing `(A[:, np.newaxis] * B).sum()`).

The great thing about `einsum` however, is that it does not build a temporary array of products first; it just sums the products as it goes. This can lead to big savings in memory use.

## A slightly bigger example

To explain the dot product, here are two new arrays:

``````A = array([[1, 1, 1],
[2, 2, 2],
[5, 5, 5]])

B = array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1]])
``````

We will compute the dot product using `np.einsum("ij,jk->ik", A, B)`. Here"s a picture showing the labelling of the `A` and `B` and the output array that we get from the function:

You can see that label `j` is repeated - this means we"re multiplying the rows of `A` with the columns of `B`. Furthermore, the label `j` is not included in the output - we"re summing these products. Labels `i` and `k` are kept for the output, so we get back a 2D array.

It might be even clearer to compare this result with the array where the label `j` is not summed. Below, on the left you can see the 3D array that results from writing `np.einsum("ij,jk->ijk", A, B)` (i.e. we"ve kept label `j`):

Summing axis `j` gives the expected dot product, shown on the right.

## Some exercises

To get more of a feel for `einsum`, it can be useful to implement familiar NumPy array operations using the subscript notation. Anything that involves combinations of multiplying and summing axes can be written using `einsum`.

Let A and B be two 1D arrays with the same length. For example, `A = np.arange(10)` and `B = np.arange(5, 15)`.

• The sum of `A` can be written:

``````np.einsum("i->", A)
``````
• Element-wise multiplication, `A * B`, can be written:

``````np.einsum("i,i->i", A, B)
``````
• The inner product or dot product, `np.inner(A, B)` or `np.dot(A, B)`, can be written:

``````np.einsum("i,i->", A, B) # or just use "i,i"
``````
• The outer product, `np.outer(A, B)`, can be written:

``````np.einsum("i,j->ij", A, B)
``````

For 2D arrays, `C` and `D`, provided that the axes are compatible lengths (both the same length or one of them of has length 1), here are a few examples:

• The trace of `C` (sum of main diagonal), `np.trace(C)`, can be written:

``````np.einsum("ii", C)
``````
• Element-wise multiplication of `C` and the transpose of `D`, `C * D.T`, can be written:

``````np.einsum("ij,ji->ij", C, D)
``````
• Multiplying each element of `C` by the array `D` (to make a 4D array), `C[:, :, None, None] * D`, can be written:

``````np.einsum("ij,kl->ijkl", C, D)
``````

# How does asyncio work?

Before answering this question we need to understand a few base terms, skip these if you already know any of them.

## Generators

Generators are objects that allow us to suspend the execution of a python function. User curated generators are implement using the keyword `yield`. By creating a normal function containing the `yield` keyword, we turn that function into a generator:

``````>>> def test():
...     yield 1
...     yield 2
...
>>> gen = test()
>>> next(gen)
1
>>> next(gen)
2
>>> next(gen)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
``````

As you can see, calling `next()` on the generator causes the interpreter to load test"s frame, and return the `yield`ed value. Calling `next()` again, cause the frame to load again into the interpreter stack, and continue on `yield`ing another value.

By the third time `next()` is called, our generator was finished, and `StopIteration` was thrown.

### Communicating with a generator

A less-known feature of generators, is the fact that you can communicate with them using two methods: `send()` and `throw()`.

``````>>> def test():
...     val = yield 1
...     print(val)
...     yield 2
...     yield 3
...
>>> gen = test()
>>> next(gen)
1
>>> gen.send("abc")
abc
2
>>> gen.throw(Exception())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in test
Exception
``````

Upon calling `gen.send()`, the value is passed as a return value from the `yield` keyword.

`gen.throw()` on the other hand, allows throwing Exceptions inside generators, with the exception raised at the same spot `yield` was called.

### Returning values from generators

Returning a value from a generator, results in the value being put inside the `StopIteration` exception. We can later on recover the value from the exception and use it to our need.

``````>>> def test():
...     yield 1
...     return "abc"
...
>>> gen = test()
>>> next(gen)
1
>>> try:
...     next(gen)
... except StopIteration as exc:
...     print(exc.value)
...
abc
``````

## Behold, a new keyword: `yield from`

Python 3.4 came with the addition of a new keyword: `yield from`. What that keyword allows us to do, is pass on any `next()`, `send()` and `throw()` into an inner-most nested generator. If the inner generator returns a value, it is also the return value of `yield from`:

``````>>> def inner():
...     inner_result = yield 2
...     print("inner", inner_result)
...     return 3
...
>>> def outer():
...     yield 1
...     val = yield from inner()
...     print("outer", val)
...     yield 4
...
>>> gen = outer()
>>> next(gen)
1
>>> next(gen) # Goes inside inner() automatically
2
>>> gen.send("abc")
inner abc
outer 3
4
``````

I"ve written an article to further elaborate on this topic.

## Putting it all together

Upon introducing the new keyword `yield from` in Python 3.4, we were now able to create generators inside generators that just like a tunnel, pass the data back and forth from the inner-most to the outer-most generators. This has spawned a new meaning for generators - coroutines.

Coroutines are functions that can be stopped and resumed while being run. In Python, they are defined using the `async def` keyword. Much like generators, they too use their own form of `yield from` which is `await`. Before `async` and `await` were introduced in Python 3.5, we created coroutines in the exact same way generators were created (with `yield from` instead of `await`).

``````async def inner():
return 1

async def outer():
await inner()
``````

Like every iterator or generator that implement the `__iter__()` method, coroutines implement `__await__()` which allows them to continue on every time `await coro` is called.

There"s a nice sequence diagram inside the Python docs that you should check out.

In asyncio, apart from coroutine functions, we have 2 important objects: tasks and futures.

### Futures

Futures are objects that have the `__await__()` method implemented, and their job is to hold a certain state and result. The state can be one of the following:

1. PENDING - future does not have any result or exception set.
2. CANCELLED - future was cancelled using `fut.cancel()`
3. FINISHED - future was finished, either by a result set using `fut.set_result()` or by an exception set using `fut.set_exception()`

The result, just like you have guessed, can either be a Python object, that will be returned, or an exception which may be raised.

Another important feature of `future` objects, is that they contain a method called `add_done_callback()`. This method allows functions to be called as soon as the task is done - whether it raised an exception or finished.

Task objects are special futures, which wrap around coroutines, and communicate with the inner-most and outer-most coroutines. Every time a coroutine `await`s a future, the future is passed all the way back to the task (just like in `yield from`), and the task receives it.

Next, the task binds itself to the future. It does so by calling `add_done_callback()` on the future. From now on, if the future will ever be done, by either being cancelled, passed an exception or passed a Python object as a result, the task"s callback will be called, and it will rise back up to existence.

# Asyncio

The final burning question we must answer is - how is the IO implemented?

Deep inside asyncio, we have an event loop. An event loop of tasks. The event loop"s job is to call tasks every time they are ready and coordinate all that effort into one single working machine.

The IO part of the event loop is built upon a single crucial function called `select`. Select is a blocking function, implemented by the operating system underneath, that allows waiting on sockets for incoming or outgoing data. Upon receiving data it wakes up, and returns the sockets which received data, or the sockets which are ready for writing.

When you try to receive or send data over a socket through asyncio, what actually happens below is that the socket is first checked if it has any data that can be immediately read or sent. If its `.send()` buffer is full, or the `.recv()` buffer is empty, the socket is registered to the `select` function (by simply adding it to one of the lists, `rlist` for `recv` and `wlist` for `send`) and the appropriate function `await`s a newly created `future` object, tied to that socket.

When all available tasks are waiting for futures, the event loop calls `select` and waits. When the one of the sockets has incoming data, or its `send` buffer drained up, asyncio checks for the future object tied to that socket, and sets it to done.

Now all the magic happens. The future is set to done, the task that added itself before with `add_done_callback()` rises up back to life, and calls `.send()` on the coroutine which resumes the inner-most coroutine (because of the `await` chain) and you read the newly received data from a nearby buffer it was spilled unto.

Method chain again, in case of `recv()`:

1. `select.select` waits.
2. A ready socket, with data is returned.
3. Data from the socket is moved into a buffer.
4. `future.set_result()` is called.
5. Task that added itself with `add_done_callback()` is now woken up.
6. Task calls `.send()` on the coroutine which goes all the way into the inner-most coroutine and wakes it up.
7. Data is being read from the buffer and returned to our humble user.

In summary, asyncio uses generator capabilities, that allow pausing and resuming functions. It uses `yield from` capabilities that allow passing data back and forth from the inner-most generator to the outer-most. It uses all of those in order to halt function execution while it"s waiting for IO to complete (by using the OS `select` function).

And the best of all? While one function is paused, another may run and interleave with the delicate fabric, which is asyncio.

If your main goal is to visualize the correlation matrix, rather than creating a plot per se, the convenient `pandas` styling options is a viable built-in solution:

``````import pandas as pd
import numpy as np

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
# "RdBu_r", "BrBG_r", & PuOr_r are other good diverging colormaps
``````

Note that this needs to be in a backend that supports rendering HTML, such as the JupyterLab Notebook.

# Styling

You can easily limit the digit precision:

``````corr.style.background_gradient(cmap="coolwarm").set_precision(2)
``````

Or get rid of the digits altogether if you prefer the matrix without annotations:

``````corr.style.background_gradient(cmap="coolwarm").set_properties(**{"font-size": "0pt"})
``````

The styling documentation also includes instructions of more advanced styles, such as how to change the display of the cell the mouse pointer is hovering over.

# Time comparison

In my testing, `style.background_gradient()` was 4x faster than `plt.matshow()` and 120x faster than `sns.heatmap()` with a 10x10 matrix. Unfortunately it doesn"t scale as well as `plt.matshow()`: the two take about the same time for a 100x100 matrix, and `plt.matshow()` is 10x faster for a 1000x1000 matrix.

# Saving

There are a few possible ways to save the stylized dataframe:

• Return the HTML by appending the `render()` method and then write the output to a file.
• Save as an `.xslx` file with conditional formatting by appending the `to_excel()` method.
• Combine with imgkit to save a bitmap
• Take a screenshot (like I have done here).

# Normalize colors across the entire matrix (pandas >= 0.24)

By setting `axis=None`, it is now possible to compute the colors based on the entire matrix rather than per column or per row:

``````corr.style.background_gradient(cmap="coolwarm", axis=None)
``````

# Single corner heatmap

Since many people are reading this answer I thought I would add a tip for how to only show one corner of the correlation matrix. I find this easier to read myself, since it removes the redundant information.

``````# Fill diagonal and upper half with NaNs
mask = np.zeros_like(corr, dtype=bool)
(corr
.style
.background_gradient(cmap="coolwarm", axis=None, vmin=-1, vmax=1)
.highlight_null(null_color="#f1f1f1")  # Color NaNs grey
.set_precision(2))
``````

Without actual data it is hard to answer the question but I guess you are looking for something like this:

``````Top15["Citable docs per Capita"].corr(Top15["Energy Supply per Capita"])
``````

That calculates the correlation between your two columns `"Citable docs per Capita"` and `"Energy Supply per Capita"`.

To give an example:

``````import pandas as pd

df = pd.DataFrame({"A": range(4), "B": [2*i for i in range(4)]})

A  B
0  0  0
1  1  2
2  2  4
3  3  6
``````

Then

``````df["A"].corr(df["B"])
``````

gives `1` as expected.

Now, if you change a value, e.g.

``````df.loc[2, "B"] = 4.5

A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0
``````

the command

``````df["A"].corr(df["B"])
``````

returns

``````0.99586
``````

which is still close to 1, as expected.

If you apply `.corr` directly to your dataframe, it will return all pairwise correlations between your columns; that"s why you then observe `1s` at the diagonal of your matrix (each column is perfectly correlated with itself).

``````df.corr()
``````

will therefore return

``````          A         B
A  1.000000  0.995862
B  0.995862  1.000000
``````

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

There can be cases, where you get `NaN`s in your solution - check this post for an example.

If you want to filter entries above/below a certain threshold, you can check this question. If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.

When objects are instantiated, the object itself is passed into the self parameter.

Because of this, the object‚Äôs data is bound to the object. Below is an example of how you might like to visualize what each object‚Äôs data might look. Notice how ‚Äòself‚Äô is replaced with the objects name. I"m not saying this example diagram below is wholly accurate but it hopefully with serve a purpose in visualizing the use of self.

The Object is passed into the self parameter so that the object can keep hold of its own data.

Although this may not be wholly accurate, think of the process of instantiating an object like this: When an object is made it uses the class as a template for its own data and methods. Without passing it"s own name into the self parameter, the attributes and methods in the class would remain as a general template and would not be referenced to (belong to) the object. So by passing the object"s name into the self parameter it means that if 100 objects are instantiated from the one class, they can all keep track of their own data and methods.

See the illustration below:

This is kind of overkill but let"s give it a go. First lets use statsmodel to find out what the p-values should be

``````import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

X = diabetes.data
y = diabetes.target

est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
``````

and we get

``````                         OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================
``````

Ok, let"s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

``````lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don"t want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
``````

And this gives us.

``````    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306
``````

So we can reproduce the values from statsmodel.

# The golden spiral method

You said you couldn‚Äôt get the golden spiral method to work and that‚Äôs a shame because it‚Äôs really, really good. I would like to give you a complete understanding of it so that maybe you can understand how to keep this away from being ‚Äúbunched up.‚Äù

So here‚Äôs a fast, non-random way to create a lattice that is approximately correct; as discussed above, no lattice will be perfect, but this may be good enough. It is compared to other methods e.g. at BendWavy.org but it just has a nice and pretty look as well as a guarantee about even spacing in the limit.

## Primer: sunflower spirals on the unit disk

To understand this algorithm, I first invite you to look at the 2D sunflower spiral algorithm. This is based on the fact that the most irrational number is the golden ratio `(1 + sqrt(5))/2` and if one emits points by the approach ‚Äústand at the center, turn a golden ratio of whole turns, then emit another point in that direction,‚Äù one naturally constructs a spiral which, as you get to higher and higher numbers of points, nevertheless refuses to have well-defined ‚Äòbars‚Äô that the points line up on.(Note 1.)

The algorithm for even spacing on a disk is,

``````from numpy import pi, cos, sin, sqrt, arange
import matplotlib.pyplot as pp

num_pts = 100
indices = arange(0, num_pts, dtype=float) + 0.5

r = sqrt(indices/num_pts)
theta = pi * (1 + 5**0.5) * indices

pp.scatter(r*cos(theta), r*sin(theta))
pp.show()
``````

and it produces results that look like (n=100 and n=1000):

## Spacing the points radially

The key strange thing is the formula `r = sqrt(indices / num_pts)`; how did I come to that one? (Note 2.)

Well, I am using the square root here because I want these to have even-area spacing around the disk. That is the same as saying that in the limit of large N I want a little region R ‚àà (r, r + dr), Œò ‚àà (Œ∏, Œ∏ + dŒ∏) to contain a number of points proportional to its area, which is r dr dŒ∏. Now if we pretend that we are talking about a random variable here, this has a straightforward interpretation as saying that the joint probability density for (R, Œò) is just c r for some constant c. Normalization on the unit disk would then force c = 1/œÄ.

Now let me introduce a trick. It comes from probability theory where it‚Äôs known as sampling the inverse CDF: suppose you wanted to generate a random variable with a probability density f(z) and you have a random variable U ~ Uniform(0, 1), just like comes out of `random()` in most programming languages. How do you do this?

1. First, turn your density into a cumulative distribution function or CDF, which we will call F(z). A CDF, remember, increases monotonically from 0 to 1 with derivative f(z).
2. Then calculate the CDF‚Äôs inverse function F-1(z).
3. You will find that Z = F-1(U) is distributed according to the target density. (Note 3).

Now the golden-ratio spiral trick spaces the points out in a nicely even pattern for Œ∏ so let‚Äôs integrate that out; for the unit disk we are left with F(r) = r2. So the inverse function is F-1(u) = u1/2, and therefore we would generate random points on the disk in polar coordinates with `r = sqrt(random()); theta = 2 * pi * random()`.

Now instead of randomly sampling this inverse function we‚Äôre uniformly sampling it, and the nice thing about uniform sampling is that our results about how points are spread out in the limit of large N will behave as if we had randomly sampled it. This combination is the trick. Instead of `random()` we use `(arange(0, num_pts, dtype=float) + 0.5)/num_pts`, so that, say, if we want to sample 10 points they are `r = 0.05, 0.15, 0.25, ... 0.95`. We uniformly sample r to get equal-area spacing, and we use the sunflower increment to avoid awful ‚Äúbars‚Äù of points in the output.

## Now doing the sunflower on a sphere

The changes that we need to make to dot the sphere with points merely involve switching out the polar coordinates for spherical coordinates. The radial coordinate of course doesn"t enter into this because we"re on a unit sphere. To keep things a little more consistent here, even though I was trained as a physicist I"ll use mathematicians" coordinates where 0 ‚â§ œÜ ‚â§ œÄ is latitude coming down from the pole and 0 ‚â§ Œ∏ ‚â§ 2œÄ is longitude. So the difference from above is that we are basically replacing the variable r with œÜ.

Our area element, which was r dr dŒ∏, now becomes the not-much-more-complicated sin(œÜ) dœÜ dŒ∏. So our joint density for uniform spacing is sin(œÜ)/4œÄ. Integrating out Œ∏, we find f(œÜ) = sin(œÜ)/2, thus F(œÜ) = (1 ‚àí cos(œÜ))/2. Inverting this we can see that a uniform random variable would look like acos(1 - 2 u), but we sample uniformly instead of randomly, so we instead use œÜk = acos(1 ‚àí 2 (k + 0.5)/N). And the rest of the algorithm is just projecting this onto the x, y, and z coordinates:

``````from numpy import pi, cos, sin, arccos, arange
import mpl_toolkits.mplot3d
import matplotlib.pyplot as pp

num_pts = 1000
indices = arange(0, num_pts, dtype=float) + 0.5

phi = arccos(1 - 2*indices/num_pts)
theta = pi * (1 + 5**0.5) * indices

x, y, z = cos(theta) * sin(phi), sin(theta) * sin(phi), cos(phi);

pp.figure().add_subplot(111, projection="3d").scatter(x, y, z);
pp.show()
``````

Again for n=100 and n=1000 the results look like:

# Further research

I wanted to give a shout out to Martin Roberts‚Äôs blog. Note that above I created an offset of my indices by adding 0.5 to each index. This was just visually appealing to me, but it turns out that the choice of offset matters a lot and is not constant over the interval and can mean getting as much as 8% better accuracy in packing if chosen correctly. There should also be a way to get his R2 sequence to cover a sphere and it would be interesting to see if this also produced a nice even covering, perhaps as-is but perhaps needing to be, say, taken from only a half of the unit square cut diagonally or so and stretched around to get a circle.

# Notes

1. Those ‚Äúbars‚Äù are formed by rational approximations to a number, and the best rational approximations to a number come from its continued fraction expression, `z + 1/(n_1 + 1/(n_2 + 1/(n_3 + ...)))` where `z` is an integer and `n_1, n_2, n_3, ...` is either a finite or infinite sequence of positive integers:

``````def continued_fraction(r):
while r != 0:
n = floor(r)
yield n
r = 1/(r - n)
``````

Since the fraction part `1/(...)` is always between zero and one, a large integer in the continued fraction allows for a particularly good rational approximation: ‚Äúone divided by something between 100 and 101‚Äù is better than ‚Äúone divided by something between 1 and 2.‚Äù The most irrational number is therefore the one which is `1 + 1/(1 + 1/(1 + ...))` and has no particularly good rational approximations; one can solve œÜ = 1 + 1/œÜ by multiplying through by œÜ to get the formula for the golden ratio.

2. For folks who are not so familiar with NumPy -- all of the functions are ‚Äúvectorized,‚Äù so that `sqrt(array)` is the same as what other languages might write `map(sqrt, array)`. So this is a component-by-component `sqrt` application. The same also holds for division by a scalar or addition with scalars -- those apply to all components in parallel.

3. The proof is simple once you know that this is the result. If you ask what"s the probability that z < Z < z + dz, this is the same as asking what"s the probability that z < F-1(U) < z + dz, apply F to all three expressions noting that it is a monotonically increasing function, hence F(z) < U < F(z + dz), expand the right hand side out to find F(z) + f(z) dz, and since U is uniform this probability is just f(z) dz as promised.

# TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP"s, it"s approximately 1000 times faster than the accepted answer.

If you don"t care about regex, use this set-based version, which is 2000 times faster than a regex union.

# Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn"t do a very good job of optimizing the pattern.

It"s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren"t really human-readable, but they do allow for very fast lookup and match.

## Example

``````["foobar", "foobah", "fooxar", "foozap", "fooza"]
``````

The list is converted to a trie:

``````{
"f": {
"o": {
"o": {
"x": {
"a": {
"r": {
"": 1
}
}
},
"b": {
"a": {
"r": {
"": 1
},
"h": {
"": 1
}
}
},
"z": {
"a": {
"": 1,
"p": {
"": 1
}
}
}
}
}
}
}
``````

And then to this regex pattern:

``````r"foo(?:ba[hr]|xar|zap?)"
``````

The huge advantage is that to test if `zoo` matches, the regex engine only needs to compare the first character (it doesn"t match), instead of trying the 5 words. It"s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that `(?:)` non-capturing groups are used because:

## Code

Here"s a slightly modified gist, which we can use as a `trie.py` library:

``````import re

class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""

def __init__(self):
self.data = {}

ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[""] = 1

def dump(self):
return self.data

def quote(self, char):
return re.escape(char)

def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None

alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0

if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append("[" + "".join(cc) + "]")

if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"

if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result

def pattern(self):
return self._pattern(self.dump())
``````

## Test

Here"s a small test (the same as this one):

``````# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open("/usr/share/dict/american-english") as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)

test_words = [
("Surely not a word", "#surely_N√∂T√§WORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
trie = Trie()
for word in words:
return re.compile(r"" + trie.pattern() + r"", re.IGNORECASE)

def find(word):
def fun():
return union.match(word)
return fun

for exp in range(1, 6):
print("
TrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print("  %s : %.1fms" % (description, time))
``````

It outputs:

``````TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms

TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms

TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms

TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms

TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
``````

For info, the regex begins like this:

Here"s a diagram of the complete trie, exported with trie-python-graphviz and graphviz `twopi`: