diag | NumPy | Python Methods and Functions

** numpy.diag (a, k = 0): ** extracts and builds a diagonal array

** Parameters :**

a :array_likek:[int, optional, 0 by default] Diagonal we require; k & gt; 0 means diagonal above main diagonal or vice versa.

**Returns :**

ndarray

` `

` ` ` # Python Programming Illustrative `

` # numpy.diag method `

` `

` import `

` numpy as geek `

` # create a matrix by entering an array `

` a `

` = `

` geek.matrix ([[`

` 1 `

`, `

` 21 `

`, `

` 30 `

`], `

` [`

` 63 `

`, `

` 434 `

`, `

` 3 `

`], `

` `

` [`

` 54 `

`, `

` 54 `

`, `

` 56 `

`]]) `

` `

` print `

` ( `

` "Main Diagnol elements:" `

`, geek.diag (a), `

` "" `

`) `

` print `

` (`

` "Diagnol above main diagnol:" `

`, geek.diag (a, `

` 1 `

`), `

` "" ``) `

` `

` print `

` (`

` "Diagnol below main diagnol:" `

` , geek.diag (a, `

` - `

` 1 `

`)) `

` `

` `

** Output: **

Main Diagnol elements: [1 434 56] Diagnosis above main diagnol: [21 3] Diagnosis below main diagnol: [63 54]

** Links: **

https://docs.scipy.org/doc/numpy/reference/generated/numpy. diagflat.html # numpy.diagflat

** Notes: **

These NumPy-Python programs will not run by onlineID, so run them on your systems to learn them

,

This article is provided by ** Mohit Gupta_OMG
**

A colleague is looking to generate UML class diagrams from heaps of Python source code. He"s primarily interested in the inheritance relationships, and mildly interested in compositional relationships, and doesn"t care much about class attributes that are just Python primitives.

The source code is pretty straightforward and not tremendously evil--it doesn"t do any fancy metaclass magic, for example. (It"s mostly from the days of Python 1.5.2, with some sprinklings of "modern" 2.3ish stuff.)

What"s the best existing solution to recommend?

I noticed that every now and then I need to Google fopen all over again, just to build a mental image of what the primary differences between the modes are. So, I thought a diagram will be faster to read next time. Maybe someone else will find that helpful too.

- Don"t decode/encode willy nilly
- Don"t assume your strings are UTF-8 encoded
- Try to convert strings to Unicode strings as soon as possible in your code
- Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
- Don"t be tempted to use quick
`reload`

hacks

Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

`UnicodeDecodeError: "ascii" codec can"t decode byte`

generally happens when you try to convert a Python 2.x `str`

that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode **point codes** and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. **Strings are decoded to Unicode** and **Unicodes are encoded to strings**. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use `unicode()`

(where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the `u`

prefix to strings. E.g.

```
>>> my_u = u"my √ºnic√¥d√© strƒØng"
>>> type(my_u)
<type "unicode">
```

Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

Conversion from `str`

to Unicode can happen even when you don"t explicitly call `unicode()`

.

The following scenarios cause `UnicodeDecodeError`

exceptions:

```
# Explicit conversion without encoding
unicode("‚Ç¨")
# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format("‚Ç¨")
# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: %s" % "‚Ç¨"
# Append string to Unicode
# Python will try to convert string to Unicode first
u"The currency is: " + "‚Ç¨"
```

In the following diagram, you can see how the word `caf√©`

has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, `caf`

is just regular ascii. In UTF-8, `√©`

is encoded using two bytes. In "Cp1252", √© is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct `decode()`

is invoked and conversion to a Python Unicode is successfull:

In this diagram, `decode()`

is called with `ascii`

(which is the same as calling `unicode()`

without an encoding given). As ASCII can"t contain bytes greater than `0x7F`

, this will throw a `UnicodeDecodeError`

exception:

It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to `str`

s on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a `u`

. E.g.

```
u"Z√ºrich"
```

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

```
# encoding: utf-8
```

This is only necessary when you have non-ASCII in your **source code**.

Usually non-ASCII data is received from a file. The `io`

module provides a TextWrapper that decodes your file on the fly, using a given `encoding`

. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

```
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
my_unicode_string = my_file.read()
```

`my_unicode_string`

would then be suitable for passing to Markdown. If a `UnicodeDecodeError`

from the `read()`

line, then you"ve probably used the wrong encoding value.

The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

```
from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
for row in csv.reader(my_file):
yield row
```

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQLIn the connection string add:

```
charset="utf8",
use_unicode=True
```

E.g.

```
>>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
```

PostgreSQL
Add:

```
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
```

Web pages can be encoded in just about any encoding. The `Content-type`

header should contain a `charset`

field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in `response.text`

.

If you must decode strings manually, you can simply do `my_string.decode(encoding)`

, where `encoding`

is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get `UnicodeDecodeError`

then you"ve probably got the wrong encoding.

Work with Unicodes as you would normal strs.

`print`

writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s `locale`

is `en_GB.UTF-8`

, the output will be encoded to `UTF-8`

. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. `PYTHONIOENCODING`

environment variable can force the encoding for stdout.

Just like input, `io.open`

can be used to transparently convert Unicodes to encoded byte strings.

The same configuration for reading will allow Unicodes to be written directly.

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular `str`

is now a Unicode string and the old `str`

is now `bytes`

.

The default encoding is UTF-8, so if you `.decode()`

a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

Further, `open()`

operates in text mode by default, so returns decoded `str`

(Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

`sys.setdefaultencoding("utf8")`

It"s a nasty hack (there"s a reason you have to use `reload`

) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen.
See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

(Note: this answer is based on a short blog post about `einsum`

I wrote a while ago.)

`einsum`

do?Imagine that we have two multi-dimensional arrays, `A`

and `B`

. Now let"s suppose we want to...

*multiply*`A`

with`B`

in a particular way to create new array of products; and then maybe*sum*this new array along particular axes; and then maybe*transpose*the axes of the new array in a particular order.

There"s a good chance that `einsum`

will help us do this faster and more memory-efficiently than combinations of the NumPy functions like `multiply`

, `sum`

and `transpose`

will allow.

`einsum`

work?Here"s a simple (but not completely trivial) example. Take the following two arrays:

```
A = np.array([0, 1, 2])
B = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
```

We will multiply `A`

and `B`

element-wise and then sum along the rows of the new array. In "normal" NumPy we"d write:

```
>>> (A[:, np.newaxis] * B).sum(axis=1)
array([ 0, 22, 76])
```

So here, the indexing operation on `A`

lines up the first axes of the two arrays so that the multiplication can be broadcast. The rows of the array of products are then summed to return the answer.

Now if we wanted to use `einsum`

instead, we could write:

```
>>> np.einsum("i,ij->i", A, B)
array([ 0, 22, 76])
```

The *signature* string `"i,ij->i"`

is the key here and needs a little bit of explaining. You can think of it in two halves. On the left-hand side (left of the `->`

) we"ve labelled the two input arrays. To the right of `->`

, we"ve labelled the array we want to end up with.

Here is what happens next:

`A`

has one axis; we"ve labelled it`i`

. And`B`

has two axes; we"ve labelled axis 0 as`i`

and axis 1 as`j`

.By

**repeating**the label`i`

in both input arrays, we are telling`einsum`

that these two axes should be**multiplied**together. In other words, we"re multiplying array`A`

with each column of array`B`

, just like`A[:, np.newaxis] * B`

does.Notice that

`j`

does not appear as a label in our desired output; we"ve just used`i`

(we want to end up with a 1D array). By**omitting**the label, we"re telling`einsum`

to**sum**along this axis. In other words, we"re summing the rows of the products, just like`.sum(axis=1)`

does.

That"s basically all you need to know to use `einsum`

. It helps to play about a little; if we leave both labels in the output, `"i,ij->ij"`

, we get back a 2D array of products (same as `A[:, np.newaxis] * B`

). If we say no output labels, `"i,ij->`

, we get back a single number (same as doing `(A[:, np.newaxis] * B).sum()`

).

The great thing about `einsum`

however, is that it does not build a temporary array of products first; it just sums the products as it goes. This can lead to big savings in memory use.

To explain the dot product, here are two new arrays:

```
A = array([[1, 1, 1],
[2, 2, 2],
[5, 5, 5]])
B = array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1]])
```

We will compute the dot product using `np.einsum("ij,jk->ik", A, B)`

. Here"s a picture showing the labelling of the `A`

and `B`

and the output array that we get from the function:

You can see that label `j`

is repeated - this means we"re multiplying the rows of `A`

with the columns of `B`

. Furthermore, the label `j`

is not included in the output - we"re summing these products. Labels `i`

and `k`

are kept for the output, so we get back a 2D array.

It might be even clearer to compare this result with the array where the label `j`

is *not* summed. Below, on the left you can see the 3D array that results from writing `np.einsum("ij,jk->ijk", A, B)`

(i.e. we"ve kept label `j`

):

Summing axis `j`

gives the expected dot product, shown on the right.

To get more of a feel for `einsum`

, it can be useful to implement familiar NumPy array operations using the subscript notation. Anything that involves combinations of multiplying and summing axes can be written using `einsum`

.

Let A and B be two 1D arrays with the same length. For example, `A = np.arange(10)`

and `B = np.arange(5, 15)`

.

The sum of

`A`

can be written:`np.einsum("i->", A)`

Element-wise multiplication,

`A * B`

, can be written:`np.einsum("i,i->i", A, B)`

The inner product or dot product,

`np.inner(A, B)`

or`np.dot(A, B)`

, can be written:`np.einsum("i,i->", A, B) # or just use "i,i"`

The outer product,

`np.outer(A, B)`

, can be written:`np.einsum("i,j->ij", A, B)`

For 2D arrays, `C`

and `D`

, provided that the axes are compatible lengths (both the same length or one of them of has length 1), here are a few examples:

The trace of

`C`

(sum of main diagonal),`np.trace(C)`

, can be written:`np.einsum("ii", C)`

Element-wise multiplication of

`C`

and the transpose of`D`

,`C * D.T`

, can be written:`np.einsum("ij,ji->ij", C, D)`

Multiplying each element of

`C`

by the array`D`

(to make a 4D array),`C[:, :, None, None] * D`

, can be written:`np.einsum("ij,kl->ijkl", C, D)`

Before answering this question we need to understand a few base terms, skip these if you already know any of them.

Generators are objects that allow us to suspend the execution of a python function. User curated generators are implement using the keyword ** yield**. By creating a normal function containing the

`yield`

keyword, we turn that function into a generator:```
>>> def test():
... yield 1
... yield 2
...
>>> gen = test()
>>> next(gen)
1
>>> next(gen)
2
>>> next(gen)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
```

As you can see, calling `next()`

on the generator causes the interpreter to load test"s frame, and return the `yield`

ed value. Calling `next()`

again, cause the frame to load again into the interpreter stack, and continue on `yield`

ing another value.

By the third time `next()`

is called, our generator was finished, and `StopIteration`

was thrown.

A less-known feature of generators, is the fact that you can communicate with them using two methods: `send()`

and `throw()`

.

```
>>> def test():
... val = yield 1
... print(val)
... yield 2
... yield 3
...
>>> gen = test()
>>> next(gen)
1
>>> gen.send("abc")
abc
2
>>> gen.throw(Exception())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in test
Exception
```

Upon calling `gen.send()`

, the value is passed as a return value from the `yield`

keyword.

`gen.throw()`

on the other hand, allows throwing Exceptions inside generators, with the exception raised at the same spot `yield`

was called.

Returning a value from a generator, results in the value being put inside the `StopIteration`

exception. We can later on recover the value from the exception and use it to our need.

```
>>> def test():
... yield 1
... return "abc"
...
>>> gen = test()
>>> next(gen)
1
>>> try:
... next(gen)
... except StopIteration as exc:
... print(exc.value)
...
abc
```

`yield from`

Python 3.4 came with the addition of a new keyword: `yield from`

. What that keyword allows us to do, is pass on any `next()`

, `send()`

and `throw()`

into an inner-most nested generator. If the inner generator returns a value, it is also the return value of `yield from`

:

```
>>> def inner():
... inner_result = yield 2
... print("inner", inner_result)
... return 3
...
>>> def outer():
... yield 1
... val = yield from inner()
... print("outer", val)
... yield 4
...
>>> gen = outer()
>>> next(gen)
1
>>> next(gen) # Goes inside inner() automatically
2
>>> gen.send("abc")
inner abc
outer 3
4
```

I"ve written an article to further elaborate on this topic.

Upon introducing the new keyword `yield from`

in Python 3.4, we were now able to create generators inside generators that just like a tunnel, pass the data back and forth from the inner-most to the outer-most generators. This has spawned a new meaning for generators - *coroutines*.

**Coroutines** are functions that can be stopped and resumed while being run. In Python, they are defined using the ** async def** keyword. Much like generators, they too use their own form of

`yield from`

which is `await`

`async`

and `await`

were introduced in Python 3.5, we created coroutines in the exact same way generators were created (with `yield from`

instead of `await`

).```
async def inner():
return 1
async def outer():
await inner()
```

Like every iterator or generator that implement the `__iter__()`

method, coroutines implement `__await__()`

which allows them to continue on every time `await coro`

is called.

There"s a nice sequence diagram inside the Python docs that you should check out.

In asyncio, apart from coroutine functions, we have 2 important objects: **tasks** and **futures**.

Futures are objects that have the `__await__()`

method implemented, and their job is to hold a certain state and result. The state can be one of the following:

- PENDING - future does not have any result or exception set.
- CANCELLED - future was cancelled using
`fut.cancel()`

- FINISHED - future was finished, either by a result set using
`fut.set_result()`

or by an exception set using`fut.set_exception()`

The result, just like you have guessed, can either be a Python object, that will be returned, or an exception which may be raised.

Another **important** feature of `future`

objects, is that they contain a method called ** add_done_callback()**. This method allows functions to be called as soon as the task is done - whether it raised an exception or finished.

Task objects are special futures, which wrap around coroutines, and communicate with the inner-most and outer-most coroutines. Every time a coroutine `await`

s a future, the future is passed all the way back to the task (just like in `yield from`

), and the task receives it.

Next, the task binds itself to the future. It does so by calling `add_done_callback()`

on the future. From now on, if the future will ever be done, by either being cancelled, passed an exception or passed a Python object as a result, the task"s callback will be called, and it will rise back up to existence.

The final burning question we must answer is - how is the IO implemented?

Deep inside asyncio, we have an event loop. An event loop of tasks. The event loop"s job is to call tasks every time they are ready and coordinate all that effort into one single working machine.

The IO part of the event loop is built upon a single crucial function called ** select**. Select is a blocking function, implemented by the operating system underneath, that allows waiting on sockets for incoming or outgoing data. Upon receiving data it wakes up, and returns the sockets which received data, or the sockets which are ready for writing.

When you try to receive or send data over a socket through asyncio, what actually happens below is that the socket is first checked if it has any data that can be immediately read or sent. If its `.send()`

buffer is full, or the `.recv()`

buffer is empty, the socket is registered to the `select`

function (by simply adding it to one of the lists, `rlist`

for `recv`

and `wlist`

for `send`

) and the appropriate function `await`

s a newly created `future`

object, tied to that socket.

When all available tasks are waiting for futures, the event loop calls `select`

and waits. When the one of the sockets has incoming data, or its `send`

buffer drained up, asyncio checks for the future object tied to that socket, and sets it to done.

Now all the magic happens. The future is set to done, the task that added itself before with `add_done_callback()`

rises up back to life, and calls `.send()`

on the coroutine which resumes the inner-most coroutine (because of the `await`

chain) and you read the newly received data from a nearby buffer it was spilled unto.

**Method chain again, in case of recv():**

`select.select`

waits.- A ready socket, with data is returned.
- Data from the socket is moved into a buffer.
`future.set_result()`

is called.- Task that added itself with
`add_done_callback()`

is now woken up. - Task calls
`.send()`

on the coroutine which goes all the way into the inner-most coroutine and wakes it up. - Data is being read from the buffer and returned to our humble user.

In summary, asyncio uses generator capabilities, that allow pausing and resuming functions. It uses `yield from`

capabilities that allow passing data back and forth from the inner-most generator to the outer-most. It uses all of those in order to halt function execution while it"s waiting for IO to complete (by using the OS `select`

function).

And the best of all? While one function is paused, another may run and interleave with the delicate fabric, which is asyncio.

If your main goal is to visualize the correlation matrix, rather than creating a plot per se, the convenient `pandas`

styling options is a viable built-in solution:

```
import pandas as pd
import numpy as np
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
corr.style.background_gradient(cmap="coolwarm")
# "RdBu_r", "BrBG_r", & PuOr_r are other good diverging colormaps
```

Note that this needs to be in a backend that supports rendering HTML, such as the JupyterLab Notebook.

You can easily limit the digit precision:

```
corr.style.background_gradient(cmap="coolwarm").set_precision(2)
```

Or get rid of the digits altogether if you prefer the matrix without annotations:

```
corr.style.background_gradient(cmap="coolwarm").set_properties(**{"font-size": "0pt"})
```

The styling documentation also includes instructions of more advanced styles, such as how to change the display of the cell the mouse pointer is hovering over.

In my testing, `style.background_gradient()`

was 4x faster than `plt.matshow()`

and 120x faster than `sns.heatmap()`

with a 10x10 matrix. Unfortunately it doesn"t scale as well as `plt.matshow()`

: the two take about the same time for a 100x100 matrix, and `plt.matshow()`

is 10x faster for a 1000x1000 matrix.

There are a few possible ways to save the stylized dataframe:

- Return the HTML by appending the
`render()`

method and then write the output to a file. - Save as an
`.xslx`

file with conditional formatting by appending the`to_excel()`

method. - Combine with imgkit to save a bitmap
- Take a screenshot (like I have done here).

By setting `axis=None`

, it is now possible to compute the colors based on the entire matrix rather than per column or per row:

```
corr.style.background_gradient(cmap="coolwarm", axis=None)
```

Since many people are reading this answer I thought I would add a tip for how to only show one corner of the correlation matrix. I find this easier to read myself, since it removes the redundant information.

```
# Fill diagonal and upper half with NaNs
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
corr[mask] = np.nan
(corr
.style
.background_gradient(cmap="coolwarm", axis=None, vmin=-1, vmax=1)
.highlight_null(null_color="#f1f1f1") # Color NaNs grey
.set_precision(2))
```

Without actual data it is hard to answer the question but I guess you are looking for something like this:

```
Top15["Citable docs per Capita"].corr(Top15["Energy Supply per Capita"])
```

That calculates the correlation between your two columns `"Citable docs per Capita"`

and `"Energy Supply per Capita"`

.

To give an example:

```
import pandas as pd
df = pd.DataFrame({"A": range(4), "B": [2*i for i in range(4)]})
A B
0 0 0
1 1 2
2 2 4
3 3 6
```

Then

```
df["A"].corr(df["B"])
```

gives `1`

as expected.

Now, if you change a value, e.g.

```
df.loc[2, "B"] = 4.5
A B
0 0 0.0
1 1 2.0
2 2 4.5
3 3 6.0
```

the command

```
df["A"].corr(df["B"])
```

returns

```
0.99586
```

which is still close to 1, as expected.

If you apply `.corr`

directly to your dataframe, it will return all pairwise correlations between your columns; that"s why you then observe `1s`

at the diagonal of your matrix (each column is perfectly correlated with itself).

```
df.corr()
```

will therefore return

```
A B
A 1.000000 0.995862
B 0.995862 1.000000
```

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

There can be cases, where you get `NaN`

s in your solution - check this post for an example.

If you want to filter entries above/below a certain threshold, you can check this question. If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.

When objects are instantiated, the object itself is passed into the self parameter.

Because of this, the object‚Äôs data is bound to the object. Below is an example of how you might like to visualize what each object‚Äôs data might look. Notice how ‚Äòself‚Äô is replaced with the objects name. I"m not saying this example diagram below is wholly accurate but it hopefully with serve a purpose in visualizing the use of self.

The Object is passed into the self parameter so that the object can keep hold of its own data.

Although this may not be wholly accurate, think of the process of instantiating an object like this: When an object is made it uses the class as a template for its own data and methods. Without passing it"s own name into the self parameter, the attributes and methods in the class would remain as a general template and would not be referenced to (belong to) the object. So by passing the object"s name into the self parameter it means that if 100 objects are instantiated from the one class, they can all keep track of their own data and methods.

See the illustration below:

This is kind of overkill but let"s give it a go. First lets use statsmodel to find out what the p-values should be

```
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
```

and we get

```
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.518
Model: OLS Adj. R-squared: 0.507
Method: Least Squares F-statistic: 46.27
Date: Wed, 08 Mar 2017 Prob (F-statistic): 3.83e-62
Time: 10:08:24 Log-Likelihood: -2386.0
No. Observations: 442 AIC: 4794.
Df Residuals: 431 BIC: 4839.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 152.1335 2.576 59.061 0.000 147.071 157.196
x1 -10.0122 59.749 -0.168 0.867 -127.448 107.424
x2 -239.8191 61.222 -3.917 0.000 -360.151 -119.488
x3 519.8398 66.534 7.813 0.000 389.069 650.610
x4 324.3904 65.422 4.958 0.000 195.805 452.976
x5 -792.1842 416.684 -1.901 0.058 -1611.169 26.801
x6 476.7458 339.035 1.406 0.160 -189.621 1143.113
x7 101.0446 212.533 0.475 0.635 -316.685 518.774
x8 177.0642 161.476 1.097 0.273 -140.313 494.442
x9 751.2793 171.902 4.370 0.000 413.409 1089.150
x10 67.6254 65.984 1.025 0.306 -62.065 197.316
==============================================================================
Omnibus: 1.506 Durbin-Watson: 2.029
Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404
Skew: 0.017 Prob(JB): 0.496
Kurtosis: 2.726 Cond. No. 227.
==============================================================================
```

Ok, let"s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

```
lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)
newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))
# Note if you don"t want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]
sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)
myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
```

And this gives us.

```
Coefficients Standard Errors t values Probabilities
0 152.1335 2.576 59.061 0.000
1 -10.0122 59.749 -0.168 0.867
2 -239.8191 61.222 -3.917 0.000
3 519.8398 66.534 7.813 0.000
4 324.3904 65.422 4.958 0.000
5 -792.1842 416.684 -1.901 0.058
6 476.7458 339.035 1.406 0.160
7 101.0446 212.533 0.475 0.635
8 177.0642 161.476 1.097 0.273
9 751.2793 171.902 4.370 0.000
10 67.6254 65.984 1.025 0.306
```

So we can reproduce the values from statsmodel.

You said you couldn‚Äôt get the golden spiral method to work and that‚Äôs a shame because it‚Äôs really, really good. I would like to give you a complete understanding of it so that maybe you can understand how to keep this away from being ‚Äúbunched up.‚Äù

So here‚Äôs a fast, non-random way to create a lattice that is approximately correct; as discussed above, no lattice will be perfect, but this may be good enough. It is compared to other methods e.g. at BendWavy.org but it just has a nice and pretty look as well as a guarantee about even spacing in the limit.

To understand this algorithm, I first invite you to look at the 2D sunflower spiral algorithm. This is based on the fact that the most irrational number is the golden ratio `(1 + sqrt(5))/2`

and if one emits points by the approach ‚Äústand at the center, turn a golden ratio of whole turns, then emit another point in that direction,‚Äù one naturally constructs a spiral which, as you get to higher and higher numbers of points, nevertheless refuses to have well-defined ‚Äòbars‚Äô that the points line up on.^{(Note 1.)}

The algorithm for even spacing on a disk is,

```
from numpy import pi, cos, sin, sqrt, arange
import matplotlib.pyplot as pp
num_pts = 100
indices = arange(0, num_pts, dtype=float) + 0.5
r = sqrt(indices/num_pts)
theta = pi * (1 + 5**0.5) * indices
pp.scatter(r*cos(theta), r*sin(theta))
pp.show()
```

and it produces results that look like (n=100 and n=1000):

The key strange thing is the formula `r = sqrt(indices / num_pts)`

; how did I come to that one? ^{(Note 2.)}

Well, I am using the square root here because I want these to have even-area spacing around the disk. That is the same as saying that in the limit of large *N* I want a little region *R* ‚àà (*r*, *r* + d*r*), *Œò* ‚àà (*Œ∏*, *Œ∏* + d*Œ∏*) to contain a number of points proportional to its area, which is *r* d*r* d*Œ∏*. Now if we pretend that we are talking about a random variable here, this has a straightforward interpretation as saying that the joint probability density for (*R*, *Œò*) is just *c r* for some constant *c*. Normalization on the unit disk would then force *c* = 1/œÄ.

Now let me introduce a trick. It comes from probability theory where it‚Äôs known as sampling the inverse CDF: suppose you wanted to *generate* a random variable with a probability density *f*(*z*) and you have a random variable *U* ~ Uniform(0, 1), just like comes out of `random()`

in most programming languages. How do you do this?

- First, turn your density into a cumulative distribution function or CDF, which we will call
*F*(*z*). A CDF, remember, increases monotonically from 0 to 1 with derivative*f*(*z*). - Then calculate the CDF‚Äôs inverse function
*F*^{-1}(*z*). - You will find that
*Z*=*F*^{-1}(*U*) is distributed according to the target density.^{(Note 3).}

Now the golden-ratio spiral trick spaces the points out in a nicely even pattern for *Œ∏* so let‚Äôs integrate that out; for the unit disk we are left with *F*(*r*) = *r*^{2}. So the inverse function is *F*^{-1}(*u*) = *u*^{1/2}, and therefore we would generate random points on the disk in polar coordinates with `r = sqrt(random()); theta = 2 * pi * random()`

.

Now instead of *randomly* sampling this inverse function we‚Äôre *uniformly* sampling it, and the nice thing about uniform sampling is that our results about how points are spread out in the limit of large *N* will behave as if we had randomly sampled it. This combination is the trick. Instead of `random()`

we use `(arange(0, num_pts, dtype=float) + 0.5)/num_pts`

, so that, say, if we want to sample 10 points they are `r = 0.05, 0.15, 0.25, ... 0.95`

. We uniformly sample *r* to get equal-area spacing, and we use the sunflower increment to avoid awful ‚Äúbars‚Äù of points in the output.

The changes that we need to make to dot the sphere with points merely involve switching out the polar coordinates for spherical coordinates. The radial coordinate of course doesn"t enter into this because we"re on a unit sphere. To keep things a little more consistent here, even though I was trained as a physicist I"ll use mathematicians" coordinates where 0 ‚â§ *œÜ* ‚â§ œÄ is latitude coming down from the pole and 0 ‚â§ *Œ∏* ‚â§ 2œÄ is longitude. So the difference from above is that we are basically replacing the variable *r* with *œÜ*.

Our area element, which was *r* d*r* d*Œ∏*, now becomes the not-much-more-complicated sin(*œÜ*) d*œÜ* d*Œ∏*. So our joint density for uniform spacing is sin(*œÜ*)/4œÄ. Integrating out *Œ∏*, we find *f*(*œÜ*) = sin(*œÜ*)/2, thus *F*(*œÜ*) = (1 ‚àí cos(*œÜ*))/2. Inverting this we can see that a uniform random variable would look like acos(1 - 2 *u*), but we sample uniformly instead of randomly, so we instead use *œÜ*_{k} = acos(1 ‚àí 2 (*k* + 0.5)/*N*). And the rest of the algorithm is just projecting this onto the x, y, and z coordinates:

```
from numpy import pi, cos, sin, arccos, arange
import mpl_toolkits.mplot3d
import matplotlib.pyplot as pp
num_pts = 1000
indices = arange(0, num_pts, dtype=float) + 0.5
phi = arccos(1 - 2*indices/num_pts)
theta = pi * (1 + 5**0.5) * indices
x, y, z = cos(theta) * sin(phi), sin(theta) * sin(phi), cos(phi);
pp.figure().add_subplot(111, projection="3d").scatter(x, y, z);
pp.show()
```

Again for n=100 and n=1000 the results look like:

I wanted to give a shout out to Martin Roberts‚Äôs blog. Note that above I created an offset of my indices by adding 0.5 to each index. This was just visually appealing to me, but it turns out that the choice of offset matters a lot and is not constant over the interval and can mean getting as much as 8% better accuracy in packing if chosen correctly. There should also be a way to get his R_{2} sequence to cover a sphere and it would be interesting to see if this also produced a nice even covering, perhaps as-is but perhaps needing to be, say, taken from only a half of the unit square cut diagonally or so and stretched around to get a circle.

Those ‚Äúbars‚Äù are formed by rational approximations to a number, and the best rational approximations to a number come from its continued fraction expression,

`z + 1/(n_1 + 1/(n_2 + 1/(n_3 + ...)))`

where`z`

is an integer and`n_1, n_2, n_3, ...`

is either a finite or infinite sequence of positive integers:`def continued_fraction(r): while r != 0: n = floor(r) yield n r = 1/(r - n)`

Since the fraction part

`1/(...)`

is always between zero and one, a large integer in the continued fraction allows for a particularly good rational approximation: ‚Äúone divided by something between 100 and 101‚Äù is better than ‚Äúone divided by something between 1 and 2.‚Äù The most irrational number is therefore the one which is`1 + 1/(1 + 1/(1 + ...))`

and has no particularly good rational approximations; one can solve*œÜ*= 1 + 1/*œÜ*by multiplying through by*œÜ*to get the formula for the golden ratio.For folks who are not so familiar with NumPy -- all of the functions are ‚Äúvectorized,‚Äù so that

`sqrt(array)`

is the same as what other languages might write`map(sqrt, array)`

. So this is a component-by-component`sqrt`

application. The same also holds for division by a scalar or addition with scalars -- those apply to all components in parallel.The proof is simple once you know that this is the result. If you ask what"s the probability that

*z*<*Z*<*z*+ d*z*, this is the same as asking what"s the probability that*z*<*F*^{-1}(*U*) <*z*+ d*z*, apply*F*to all three expressions noting that it is a monotonically increasing function, hence*F*(*z*) <*U*<*F*(*z*+ d*z*), expand the right hand side out to find*F*(*z*) +*f*(*z*) d*z*, and since*U*is uniform this probability is just*f*(*z*) d*z*as promised.

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP"s, it"s approximately 1000 times faster than the accepted answer.

If you don"t care about regex, use this set-based version, which is 2000 times faster than a regex union.

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn"t do a very good job of optimizing the pattern.

It"s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren"t really human-readable, but they do allow for very fast lookup and match.

```
["foobar", "foobah", "fooxar", "foozap", "fooza"]
```

The list is converted to a trie:

```
{
"f": {
"o": {
"o": {
"x": {
"a": {
"r": {
"": 1
}
}
},
"b": {
"a": {
"r": {
"": 1
},
"h": {
"": 1
}
}
},
"z": {
"a": {
"": 1,
"p": {
"": 1
}
}
}
}
}
}
}
```

And then to this regex pattern:

```
r"foo(?:ba[hr]|xar|zap?)"
```

The huge advantage is that to test if `zoo`

matches, the regex engine only needs to compare the first character (it doesn"t match), instead of trying the 5 words. It"s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that `(?:)`

non-capturing groups are used because:

`foobar|baz`

would match`foobar`

or`baz`

, but not`foobaz`

`foo(bar|baz)`

would save unneeded information to a capturing group.

Here"s a slightly modified gist, which we can use as a `trie.py`

library:

```
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[""] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append("[" + "".join(cc) + "]")
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
```

Here"s a small test (the same as this one):

```
# Encoding: utf-8
import re
import timeit
import random
from trie import Trie
with open("/usr/share/dict/american-english") as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)
test_words = [
("Surely not a word", "#surely_N√∂T√§WORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]
def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"" + trie.pattern() + r"", re.IGNORECASE)
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("
TrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))
```

It outputs:

```
TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms
TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms
TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms
TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms
TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
```

For info, the regex begins like this:

(?:a(?:(?:"s|a(?:"s|chen|liyah(?:"s)?|r(?:dvark(?:(?:"s|s))?|on))|b(?:"s|a(?:c(?:us(?:(?:"s|es))?|[ik])|ft|lone(?:(?:"s|s))?|ndon(?:(?:ed|ing|ment(?:"s)?|s))?|s(?:e(?:(?:ment(?:"s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:"s)?|[ds]))?|ing|toir(?:(?:"s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:"s|es))?|y(?:(?:"s|s))?)|ot(?:(?:"s|t(?:"s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:"s|s))?))|y(?:"s)?|√©(?:(?:"s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:"s|s))?))|om(?:en(?:(?:"s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:"s|s))?)|or(?:(?:"s|s))?|s))?|l(?:"s)?))|e(?:(?:"s|am|l(?:(?:"s|ard|son(?:"s)?))?|r(?:deen(?:"s)?|nathy(?:"s)?|ra(?:nt|tion(?:(?:"s|s))?))|t(?:(?:t(?:e(?:r(?:(?:"s|s))?|d)|ing|or(?:(?:"s|s))?)|s))?|yance(?:"s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:"s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:"s)?)|gail|l(?:ene|it(?:ies|y(?:"s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:"s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:"s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:"s|s))?|y)|m"s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:"s)?))|r(?:"s)?)|ormal(?:(?:it(?:ies|y(?:"s)?)|ly))?)|o(?:ard|de(?:(?:"s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:"s|ist(?:(?:"s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:"s|s))?)))|r(?:igin(?:al(?:(?:"s|s))?|e(?:(?:"s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:"s|ist(?:(?:"s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:"s|board))?)|r(?:a(?:cadabra(?:"s)?|d(?:e[ds]?|ing)|ham(?:"s)?|m(?:(?:"s|s))?|si(?:on(?:(?:"s|s))?|ve(?:(?:"s|ly|ness(?:"s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:"s|s))?|[ds]))?|ing|ment(?:(?:"s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:"s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:"s)?))?)|s(?:alom|c(?:ess(?:(?:"s|e[ds]|ing))?|issa(?:(?:"s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:"s|s))?|t(?:(?:e(?:e(?:(?:"s|ism(?:"s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:"s|e(?:"s)?))?|o(?:l(?:ut(?:e(?:(?:"s|ly|st?))?|i(?:on(?:"s)?|sm(?:"s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:"s)?|t(?:(?:"s|s))?)|d)|ing|s))?|pti...

It"s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here"s a diagram of the complete trie, exported with trie-python-graphviz and graphviz `twopi`

:

Become a master in Python by learning coding best practices and advanced programming concepts in Python 3.7. This book is written for Python developers who wish to go further in mastering Python. A...

23/09/2020

Vincent Bumgarner has been designing software for nearly 20 years, working in many languages on nearly as many platforms. He started using Splunk in 2007 and has enjoyed watching the product evolve ov...

10/07/2020

Grokking Deep Learning teaches you to build deep learning neural networks from scratch! In his immersive style, deep learning expert Andrew Trask shows you the hidden science so you can uncover every ...

05/09/2021

Scientific progress has increasingly become reliant on large-scale data collection and analysis methodologies. The same is true for the advanced use of computing in business, government, and other are...

10/07/2020

X
# Submit new EBook