Using the sympy.Matrix () method. Diagonalize () we can diagonalize the matrix. diagonalize () returns a tuple , where diagonally and ,
Syntax: Matrix (). Diagonalize ()
Returns: Returns a tuple of matrix where the second element represents the diagonal of the matrix.
Example # 1:

Exit:
Matrix: Matrix ([[3, 2, 4, 2], [5, 3, 3, 2], [5, 2, 2, 2], [5, 2, 3, 3]])
Diagonal of a matrix: Matrix ([[ 2, 0, 0, 0], [0, 3, 0, 0], [0, 0, 5, 0], [0, 0, 0, 5]])
Example # 2:

Exit:
Matrix: Matrix ([[1 , 3, 3], [3, 5, 3], [6, 6, 4]])
Diagonal of a matrix: Matrix ([[ 2, 0, 0], [0, 2, 0], [0, 0, 4]])
A colleague is looking to generate UML class diagrams from heaps of Python source code. He"s primarily interested in the inheritance relationships, and mildly interested in compositional relationships, and doesn"t care much about class attributes that are just Python primitives.
The source code is pretty straightforward and not tremendously evilit doesn"t do any fancy metaclass magic, for example. (It"s mostly from the days of Python 1.5.2, with some sprinklings of "modern" 2.3ish stuff.)
What"s the best existing solution to recommend?
I noticed that every now and then I need to Google fopen all over again, just to build a mental image of what the primary differences between the modes are. So, I thought a diagram will be faster to read next time. Maybe someone else will find that helpful too.
reload
hacksWithout seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.
UnicodeDecodeError: "ascii" codec can"t decode byte
generally happens when you try to convert a Python 2.x str
that contains nonASCII to a Unicode string without specifying the encoding of the original string.
In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF8, UTF16, ISO88951, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.
The Markdown module authors probably use unicode()
(where the exception is thrown) as a quality gate to the rest of the code  it will convert ASCII or rewrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.
Unicode strings can be declared in your code using the u
prefix to strings. E.g.
>>> my_u = u"my √ºnic√¥d√© strƒØng"
>>> type(my_u)
<type "unicode">
Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.
Conversion from str
to Unicode can happen even when you don"t explicitly call unicode()
.
The following scenarios cause UnicodeDecodeError
exceptions:
# Explicit conversion without encoding
unicode("‚Ç¨")
# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format("‚Ç¨")
# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: %s" % "‚Ç¨"
# Append string to Unicode
# Python will try to convert string to Unicode first
u"The currency is: " + "‚Ç¨"
In the following diagram, you can see how the word caf√©
has been encoded in either "UTF8" or "Cp1252" encoding depending on the terminal type. In both examples, caf
is just regular ascii. In UTF8, √©
is encoded using two bytes. In "Cp1252", √© is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct decode()
is invoked and conversion to a Python Unicode is successfull:
In this diagram, decode()
is called with ascii
(which is the same as calling unicode()
without an encoding given). As ASCII can"t contain bytes greater than 0x7F
, this will throw a UnicodeDecodeError
exception:
It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to str
s on the way out. This saves you from worrying about the encoding of strings in the middle of your code.
If you need to bake nonASCII into your source code, just create Unicode strings by prefixing the string with a u
. E.g.
u"Z√ºrich"
To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF8", you would use:
# encoding: utf8
This is only necessary when you have nonASCII in your source code.
Usually nonASCII data is received from a file. The io
module provides a TextWrapper that decodes your file on the fly, using a given encoding
. You must use the correct encoding for the file  it can"t be easily guessed. For example, for a UTF8 file:
import io
with io.open("my_utf8_file.txt", "r", encoding="utf8") as my_file:
my_unicode_string = my_file.read()
my_unicode_string
would then be suitable for passing to Markdown. If a UnicodeDecodeError
from the read()
line, then you"ve probably used the wrong encoding value.
The Python 2.7 CSV module does not support nonASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.
Use it like above but pass the opened file to it:
from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf8") as my_file:
for row in csv.reader(my_file):
yield row
Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.
MySQLIn the connection string add:
charset="utf8",
use_unicode=True
E.g.
>>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
PostgreSQL
Add:
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
Web pages can be encoded in just about any encoding. The Contenttype
header should contain a charset
field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, PythonRequests returns Unicodes in response.text
.
If you must decode strings manually, you can simply do my_string.decode(encoding)
, where encoding
is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError
then you"ve probably got the wrong encoding.
Work with Unicodes as you would normal strs.
print
writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s locale
is en_GB.UTF8
, the output will be encoded to UTF8
. On Windows, you will be limited to an 8bit code page.
An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING
environment variable can force the encoding for stdout.
Just like input, io.open
can be used to transparently convert Unicodes to encoded byte strings.
The same configuration for reading will allow Unicodes to be written directly.
Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str
is now a Unicode string and the old str
is now bytes
.
The default encoding is UTF8, so if you .decode()
a byte string without giving an encoding, Python 3 uses UTF8 encoding. This probably fixes 50% of people"s Unicode problems.
Further, open()
operates in text mode by default, so returns decoded str
(Unicode ones). The encoding is derived from your locale, which tends to be UTF8 on Un*x systems or an 8bit code page, such as windows1251, on Windows boxes.
sys.setdefaultencoding("utf8")
It"s a nasty hack (there"s a reason you have to use reload
) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen.
See Why should we NOT use sys.setdefaultencoding("utf8") in a py script? for further details
(Note: this answer is based on a short blog post about einsum
I wrote a while ago.)
einsum
do?Imagine that we have two multidimensional arrays, A
and B
. Now let"s suppose we want to...
A
with B
in a particular way to create new array of products; and then maybeThere"s a good chance that einsum
will help us do this faster and more memoryefficiently than combinations of the NumPy functions like multiply
, sum
and transpose
will allow.
einsum
work?Here"s a simple (but not completely trivial) example. Take the following two arrays:
A = np.array([0, 1, 2])
B = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
We will multiply A
and B
elementwise and then sum along the rows of the new array. In "normal" NumPy we"d write:
>>> (A[:, np.newaxis] * B).sum(axis=1)
array([ 0, 22, 76])
So here, the indexing operation on A
lines up the first axes of the two arrays so that the multiplication can be broadcast. The rows of the array of products are then summed to return the answer.
Now if we wanted to use einsum
instead, we could write:
>>> np.einsum("i,ij>i", A, B)
array([ 0, 22, 76])
The signature string "i,ij>i"
is the key here and needs a little bit of explaining. You can think of it in two halves. On the lefthand side (left of the >
) we"ve labelled the two input arrays. To the right of >
, we"ve labelled the array we want to end up with.
Here is what happens next:
A
has one axis; we"ve labelled it i
. And B
has two axes; we"ve labelled axis 0 as i
and axis 1 as j
.
By repeating the label i
in both input arrays, we are telling einsum
that these two axes should be multiplied together. In other words, we"re multiplying array A
with each column of array B
, just like A[:, np.newaxis] * B
does.
Notice that j
does not appear as a label in our desired output; we"ve just used i
(we want to end up with a 1D array). By omitting the label, we"re telling einsum
to sum along this axis. In other words, we"re summing the rows of the products, just like .sum(axis=1)
does.
That"s basically all you need to know to use einsum
. It helps to play about a little; if we leave both labels in the output, "i,ij>ij"
, we get back a 2D array of products (same as A[:, np.newaxis] * B
). If we say no output labels, "i,ij>
, we get back a single number (same as doing (A[:, np.newaxis] * B).sum()
).
The great thing about einsum
however, is that it does not build a temporary array of products first; it just sums the products as it goes. This can lead to big savings in memory use.
To explain the dot product, here are two new arrays:
A = array([[1, 1, 1],
[2, 2, 2],
[5, 5, 5]])
B = array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1]])
We will compute the dot product using np.einsum("ij,jk>ik", A, B)
. Here"s a picture showing the labelling of the A
and B
and the output array that we get from the function:
You can see that label j
is repeated  this means we"re multiplying the rows of A
with the columns of B
. Furthermore, the label j
is not included in the output  we"re summing these products. Labels i
and k
are kept for the output, so we get back a 2D array.
It might be even clearer to compare this result with the array where the label j
is not summed. Below, on the left you can see the 3D array that results from writing np.einsum("ij,jk>ijk", A, B)
(i.e. we"ve kept label j
):
Summing axis j
gives the expected dot product, shown on the right.
To get more of a feel for einsum
, it can be useful to implement familiar NumPy array operations using the subscript notation. Anything that involves combinations of multiplying and summing axes can be written using einsum
.
Let A and B be two 1D arrays with the same length. For example, A = np.arange(10)
and B = np.arange(5, 15)
.
The sum of A
can be written:
np.einsum("i>", A)
Elementwise multiplication, A * B
, can be written:
np.einsum("i,i>i", A, B)
The inner product or dot product, np.inner(A, B)
or np.dot(A, B)
, can be written:
np.einsum("i,i>", A, B) # or just use "i,i"
The outer product, np.outer(A, B)
, can be written:
np.einsum("i,j>ij", A, B)
For 2D arrays, C
and D
, provided that the axes are compatible lengths (both the same length or one of them of has length 1), here are a few examples:
The trace of C
(sum of main diagonal), np.trace(C)
, can be written:
np.einsum("ii", C)
Elementwise multiplication of C
and the transpose of D
, C * D.T
, can be written:
np.einsum("ij,ji>ij", C, D)
Multiplying each element of C
by the array D
(to make a 4D array), C[:, :, None, None] * D
, can be written:
np.einsum("ij,kl>ijkl", C, D)
Before answering this question we need to understand a few base terms, skip these if you already know any of them.
Generators are objects that allow us to suspend the execution of a python function. User curated generators are implement using the keyword yield
. By creating a normal function containing the yield
keyword, we turn that function into a generator:
>>> def test():
... yield 1
... yield 2
...
>>> gen = test()
>>> next(gen)
1
>>> next(gen)
2
>>> next(gen)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
As you can see, calling next()
on the generator causes the interpreter to load test"s frame, and return the yield
ed value. Calling next()
again, cause the frame to load again into the interpreter stack, and continue on yield
ing another value.
By the third time next()
is called, our generator was finished, and StopIteration
was thrown.
A lessknown feature of generators, is the fact that you can communicate with them using two methods: send()
and throw()
.
>>> def test():
... val = yield 1
... print(val)
... yield 2
... yield 3
...
>>> gen = test()
>>> next(gen)
1
>>> gen.send("abc")
abc
2
>>> gen.throw(Exception())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in test
Exception
Upon calling gen.send()
, the value is passed as a return value from the yield
keyword.
gen.throw()
on the other hand, allows throwing Exceptions inside generators, with the exception raised at the same spot yield
was called.
Returning a value from a generator, results in the value being put inside the StopIteration
exception. We can later on recover the value from the exception and use it to our need.
>>> def test():
... yield 1
... return "abc"
...
>>> gen = test()
>>> next(gen)
1
>>> try:
... next(gen)
... except StopIteration as exc:
... print(exc.value)
...
abc
yield from
Python 3.4 came with the addition of a new keyword: yield from
. What that keyword allows us to do, is pass on any next()
, send()
and throw()
into an innermost nested generator. If the inner generator returns a value, it is also the return value of yield from
:
>>> def inner():
... inner_result = yield 2
... print("inner", inner_result)
... return 3
...
>>> def outer():
... yield 1
... val = yield from inner()
... print("outer", val)
... yield 4
...
>>> gen = outer()
>>> next(gen)
1
>>> next(gen) # Goes inside inner() automatically
2
>>> gen.send("abc")
inner abc
outer 3
4
I"ve written an article to further elaborate on this topic.
Upon introducing the new keyword yield from
in Python 3.4, we were now able to create generators inside generators that just like a tunnel, pass the data back and forth from the innermost to the outermost generators. This has spawned a new meaning for generators  coroutines.
Coroutines are functions that can be stopped and resumed while being run. In Python, they are defined using the async def
keyword. Much like generators, they too use their own form of yield from
which is await
. Before async
and await
were introduced in Python 3.5, we created coroutines in the exact same way generators were created (with yield from
instead of await
).
async def inner():
return 1
async def outer():
await inner()
Like every iterator or generator that implement the __iter__()
method, coroutines implement __await__()
which allows them to continue on every time await coro
is called.
There"s a nice sequence diagram inside the Python docs that you should check out.
In asyncio, apart from coroutine functions, we have 2 important objects: tasks and futures.
Futures are objects that have the __await__()
method implemented, and their job is to hold a certain state and result. The state can be one of the following:
fut.cancel()
fut.set_result()
or by an exception set using fut.set_exception()
The result, just like you have guessed, can either be a Python object, that will be returned, or an exception which may be raised.
Another important feature of future
objects, is that they contain a method called add_done_callback()
. This method allows functions to be called as soon as the task is done  whether it raised an exception or finished.
Task objects are special futures, which wrap around coroutines, and communicate with the innermost and outermost coroutines. Every time a coroutine await
s a future, the future is passed all the way back to the task (just like in yield from
), and the task receives it.
Next, the task binds itself to the future. It does so by calling add_done_callback()
on the future. From now on, if the future will ever be done, by either being cancelled, passed an exception or passed a Python object as a result, the task"s callback will be called, and it will rise back up to existence.
The final burning question we must answer is  how is the IO implemented?
Deep inside asyncio, we have an event loop. An event loop of tasks. The event loop"s job is to call tasks every time they are ready and coordinate all that effort into one single working machine.
The IO part of the event loop is built upon a single crucial function called select
. Select is a blocking function, implemented by the operating system underneath, that allows waiting on sockets for incoming or outgoing data. Upon receiving data it wakes up, and returns the sockets which received data, or the sockets which are ready for writing.
When you try to receive or send data over a socket through asyncio, what actually happens below is that the socket is first checked if it has any data that can be immediately read or sent. If its .send()
buffer is full, or the .recv()
buffer is empty, the socket is registered to the select
function (by simply adding it to one of the lists, rlist
for recv
and wlist
for send
) and the appropriate function await
s a newly created future
object, tied to that socket.
When all available tasks are waiting for futures, the event loop calls select
and waits. When the one of the sockets has incoming data, or its send
buffer drained up, asyncio checks for the future object tied to that socket, and sets it to done.
Now all the magic happens. The future is set to done, the task that added itself before with add_done_callback()
rises up back to life, and calls .send()
on the coroutine which resumes the innermost coroutine (because of the await
chain) and you read the newly received data from a nearby buffer it was spilled unto.
Method chain again, in case of recv()
:
select.select
waits.future.set_result()
is called.add_done_callback()
is now woken up..send()
on the coroutine which goes all the way into the innermost coroutine and wakes it up.In summary, asyncio uses generator capabilities, that allow pausing and resuming functions. It uses yield from
capabilities that allow passing data back and forth from the innermost generator to the outermost. It uses all of those in order to halt function execution while it"s waiting for IO to complete (by using the OS select
function).
And the best of all? While one function is paused, another may run and interleave with the delicate fabric, which is asyncio.
If your main goal is to visualize the correlation matrix, rather than creating a plot per se, the convenient pandas
styling options is a viable builtin solution:
import pandas as pd
import numpy as np
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
corr.style.background_gradient(cmap="coolwarm")
# "RdBu_r", "BrBG_r", & PuOr_r are other good diverging colormaps
Note that this needs to be in a backend that supports rendering HTML, such as the JupyterLab Notebook.
You can easily limit the digit precision:
corr.style.background_gradient(cmap="coolwarm").set_precision(2)
Or get rid of the digits altogether if you prefer the matrix without annotations:
corr.style.background_gradient(cmap="coolwarm").set_properties(**{"fontsize": "0pt"})
The styling documentation also includes instructions of more advanced styles, such as how to change the display of the cell the mouse pointer is hovering over.
In my testing, style.background_gradient()
was 4x faster than plt.matshow()
and 120x faster than sns.heatmap()
with a 10x10 matrix. Unfortunately it doesn"t scale as well as plt.matshow()
: the two take about the same time for a 100x100 matrix, and plt.matshow()
is 10x faster for a 1000x1000 matrix.
There are a few possible ways to save the stylized dataframe:
render()
method and then write the output to a file..xslx
file with conditional formatting by appending the to_excel()
method.By setting axis=None
, it is now possible to compute the colors based on the entire matrix rather than per column or per row:
corr.style.background_gradient(cmap="coolwarm", axis=None)
Since many people are reading this answer I thought I would add a tip for how to only show one corner of the correlation matrix. I find this easier to read myself, since it removes the redundant information.
# Fill diagonal and upper half with NaNs
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
corr[mask] = np.nan
(corr
.style
.background_gradient(cmap="coolwarm", axis=None, vmin=1, vmax=1)
.highlight_null(null_color="#f1f1f1") # Color NaNs grey
.set_precision(2))
Without actual data it is hard to answer the question but I guess you are looking for something like this:
Top15["Citable docs per Capita"].corr(Top15["Energy Supply per Capita"])
That calculates the correlation between your two columns "Citable docs per Capita"
and "Energy Supply per Capita"
.
To give an example:
import pandas as pd
df = pd.DataFrame({"A": range(4), "B": [2*i for i in range(4)]})
A B
0 0 0
1 1 2
2 2 4
3 3 6
Then
df["A"].corr(df["B"])
gives 1
as expected.
Now, if you change a value, e.g.
df.loc[2, "B"] = 4.5
A B
0 0 0.0
1 1 2.0
2 2 4.5
3 3 6.0
the command
df["A"].corr(df["B"])
returns
0.99586
which is still close to 1, as expected.
If you apply .corr
directly to your dataframe, it will return all pairwise correlations between your columns; that"s why you then observe 1s
at the diagonal of your matrix (each column is perfectly correlated with itself).
df.corr()
will therefore return
A B
A 1.000000 0.995862
B 0.995862 1.000000
In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).
There can be cases, where you get NaN
s in your solution  check this post for an example.
If you want to filter entries above/below a certain threshold, you can check this question. If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axislabels check the following post.
When objects are instantiated, the object itself is passed into the self parameter.
Because of this, the object‚Äôs data is bound to the object. Below is an example of how you might like to visualize what each object‚Äôs data might look. Notice how ‚Äòself‚Äô is replaced with the objects name. I"m not saying this example diagram below is wholly accurate but it hopefully with serve a purpose in visualizing the use of self.
The Object is passed into the self parameter so that the object can keep hold of its own data.
Although this may not be wholly accurate, think of the process of instantiating an object like this: When an object is made it uses the class as a template for its own data and methods. Without passing it"s own name into the self parameter, the attributes and methods in the class would remain as a general template and would not be referenced to (belong to) the object. So by passing the object"s name into the self parameter it means that if 100 objects are instantiated from the one class, they can all keep track of their own data and methods.
See the illustration below:
This is kind of overkill but let"s give it a go. First lets use statsmodel to find out what the pvalues should be
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
and we get
OLS Regression Results
==============================================================================
Dep. Variable: y Rsquared: 0.518
Model: OLS Adj. Rsquared: 0.507
Method: Least Squares Fstatistic: 46.27
Date: Wed, 08 Mar 2017 Prob (Fstatistic): 3.83e62
Time: 10:08:24 LogLikelihood: 2386.0
No. Observations: 442 AIC: 4794.
Df Residuals: 431 BIC: 4839.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [0.025 0.975]

const 152.1335 2.576 59.061 0.000 147.071 157.196
x1 10.0122 59.749 0.168 0.867 127.448 107.424
x2 239.8191 61.222 3.917 0.000 360.151 119.488
x3 519.8398 66.534 7.813 0.000 389.069 650.610
x4 324.3904 65.422 4.958 0.000 195.805 452.976
x5 792.1842 416.684 1.901 0.058 1611.169 26.801
x6 476.7458 339.035 1.406 0.160 189.621 1143.113
x7 101.0446 212.533 0.475 0.635 316.685 518.774
x8 177.0642 161.476 1.097 0.273 140.313 494.442
x9 751.2793 171.902 4.370 0.000 413.409 1089.150
x10 67.6254 65.984 1.025 0.306 62.065 197.316
==============================================================================
Omnibus: 1.506 DurbinWatson: 2.029
Prob(Omnibus): 0.471 JarqueBera (JB): 1.404
Skew: 0.017 Prob(JB): 0.496
Kurtosis: 2.726 Cond. No. 227.
==============================================================================
Ok, let"s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.
lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)
newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((ypredictions)**2))/(len(newX)len(newX.columns))
# Note if you don"t want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((ypredictions)**2))/(len(newX)len(newX[0]))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b
p_values =[2*(1stats.t.cdf(np.abs(i),(len(newX)len(newX[0])))) for i in ts_b]
sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)
myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
And this gives us.
Coefficients Standard Errors t values Probabilities
0 152.1335 2.576 59.061 0.000
1 10.0122 59.749 0.168 0.867
2 239.8191 61.222 3.917 0.000
3 519.8398 66.534 7.813 0.000
4 324.3904 65.422 4.958 0.000
5 792.1842 416.684 1.901 0.058
6 476.7458 339.035 1.406 0.160
7 101.0446 212.533 0.475 0.635
8 177.0642 161.476 1.097 0.273
9 751.2793 171.902 4.370 0.000
10 67.6254 65.984 1.025 0.306
So we can reproduce the values from statsmodel.
You said you couldn‚Äôt get the golden spiral method to work and that‚Äôs a shame because it‚Äôs really, really good. I would like to give you a complete understanding of it so that maybe you can understand how to keep this away from being ‚Äúbunched up.‚Äù
So here‚Äôs a fast, nonrandom way to create a lattice that is approximately correct; as discussed above, no lattice will be perfect, but this may be good enough. It is compared to other methods e.g. at BendWavy.org but it just has a nice and pretty look as well as a guarantee about even spacing in the limit.
To understand this algorithm, I first invite you to look at the 2D sunflower spiral algorithm. This is based on the fact that the most irrational number is the golden ratio (1 + sqrt(5))/2
and if one emits points by the approach ‚Äústand at the center, turn a golden ratio of whole turns, then emit another point in that direction,‚Äù one naturally constructs a spiral which, as you get to higher and higher numbers of points, nevertheless refuses to have welldefined ‚Äòbars‚Äô that the points line up on.^{(Note 1.)}
The algorithm for even spacing on a disk is,
from numpy import pi, cos, sin, sqrt, arange
import matplotlib.pyplot as pp
num_pts = 100
indices = arange(0, num_pts, dtype=float) + 0.5
r = sqrt(indices/num_pts)
theta = pi * (1 + 5**0.5) * indices
pp.scatter(r*cos(theta), r*sin(theta))
pp.show()
and it produces results that look like (n=100 and n=1000):
The key strange thing is the formula r = sqrt(indices / num_pts)
; how did I come to that one? ^{(Note 2.)}
Well, I am using the square root here because I want these to have evenarea spacing around the disk. That is the same as saying that in the limit of large N I want a little region R ‚àà (r, r + dr), Œò ‚àà (Œ∏, Œ∏ + dŒ∏) to contain a number of points proportional to its area, which is r dr dŒ∏. Now if we pretend that we are talking about a random variable here, this has a straightforward interpretation as saying that the joint probability density for (R, Œò) is just c r for some constant c. Normalization on the unit disk would then force c = 1/œÄ.
Now let me introduce a trick. It comes from probability theory where it‚Äôs known as sampling the inverse CDF: suppose you wanted to generate a random variable with a probability density f(z) and you have a random variable U ~ Uniform(0, 1), just like comes out of random()
in most programming languages. How do you do this?
Now the goldenratio spiral trick spaces the points out in a nicely even pattern for Œ∏ so let‚Äôs integrate that out; for the unit disk we are left with F(r) = r^{2}. So the inverse function is F^{1}(u) = u^{1/2}, and therefore we would generate random points on the disk in polar coordinates with r = sqrt(random()); theta = 2 * pi * random()
.
Now instead of randomly sampling this inverse function we‚Äôre uniformly sampling it, and the nice thing about uniform sampling is that our results about how points are spread out in the limit of large N will behave as if we had randomly sampled it. This combination is the trick. Instead of random()
we use (arange(0, num_pts, dtype=float) + 0.5)/num_pts
, so that, say, if we want to sample 10 points they are r = 0.05, 0.15, 0.25, ... 0.95
. We uniformly sample r to get equalarea spacing, and we use the sunflower increment to avoid awful ‚Äúbars‚Äù of points in the output.
The changes that we need to make to dot the sphere with points merely involve switching out the polar coordinates for spherical coordinates. The radial coordinate of course doesn"t enter into this because we"re on a unit sphere. To keep things a little more consistent here, even though I was trained as a physicist I"ll use mathematicians" coordinates where 0 ‚â§ œÜ ‚â§ œÄ is latitude coming down from the pole and 0 ‚â§ Œ∏ ‚â§ 2œÄ is longitude. So the difference from above is that we are basically replacing the variable r with œÜ.
Our area element, which was r dr dŒ∏, now becomes the notmuchmorecomplicated sin(œÜ) dœÜ dŒ∏. So our joint density for uniform spacing is sin(œÜ)/4œÄ. Integrating out Œ∏, we find f(œÜ) = sin(œÜ)/2, thus F(œÜ) = (1 ‚àí cos(œÜ))/2. Inverting this we can see that a uniform random variable would look like acos(1  2 u), but we sample uniformly instead of randomly, so we instead use œÜ_{k} = acos(1 ‚àí 2 (k + 0.5)/N). And the rest of the algorithm is just projecting this onto the x, y, and z coordinates:
from numpy import pi, cos, sin, arccos, arange
import mpl_toolkits.mplot3d
import matplotlib.pyplot as pp
num_pts = 1000
indices = arange(0, num_pts, dtype=float) + 0.5
phi = arccos(1  2*indices/num_pts)
theta = pi * (1 + 5**0.5) * indices
x, y, z = cos(theta) * sin(phi), sin(theta) * sin(phi), cos(phi);
pp.figure().add_subplot(111, projection="3d").scatter(x, y, z);
pp.show()
Again for n=100 and n=1000 the results look like:
I wanted to give a shout out to Martin Roberts‚Äôs blog. Note that above I created an offset of my indices by adding 0.5 to each index. This was just visually appealing to me, but it turns out that the choice of offset matters a lot and is not constant over the interval and can mean getting as much as 8% better accuracy in packing if chosen correctly. There should also be a way to get his R_{2} sequence to cover a sphere and it would be interesting to see if this also produced a nice even covering, perhaps asis but perhaps needing to be, say, taken from only a half of the unit square cut diagonally or so and stretched around to get a circle.
Those ‚Äúbars‚Äù are formed by rational approximations to a number, and the best rational approximations to a number come from its continued fraction expression, z + 1/(n_1 + 1/(n_2 + 1/(n_3 + ...)))
where z
is an integer and n_1, n_2, n_3, ...
is either a finite or infinite sequence of positive integers:
def continued_fraction(r):
while r != 0:
n = floor(r)
yield n
r = 1/(r  n)
Since the fraction part 1/(...)
is always between zero and one, a large integer in the continued fraction allows for a particularly good rational approximation: ‚Äúone divided by something between 100 and 101‚Äù is better than ‚Äúone divided by something between 1 and 2.‚Äù The most irrational number is therefore the one which is 1 + 1/(1 + 1/(1 + ...))
and has no particularly good rational approximations; one can solve œÜ = 1 + 1/œÜ by multiplying through by œÜ to get the formula for the golden ratio.
For folks who are not so familiar with NumPy  all of the functions are ‚Äúvectorized,‚Äù so that sqrt(array)
is the same as what other languages might write map(sqrt, array)
. So this is a componentbycomponent sqrt
application. The same also holds for division by a scalar or addition with scalars  those apply to all components in parallel.
The proof is simple once you know that this is the result. If you ask what"s the probability that z < Z < z + dz, this is the same as asking what"s the probability that z < F^{1}(U) < z + dz, apply F to all three expressions noting that it is a monotonically increasing function, hence F(z) < U < F(z + dz), expand the right hand side out to find F(z) + f(z) dz, and since U is uniform this probability is just f(z) dz as promised.
Use this method if you want the fastest regexbased solution. For a dataset similar to the OP"s, it"s approximately 1000 times faster than the accepted answer.
If you don"t care about regex, use this setbased version, which is 2000 times faster than a regex union.
A simple Regex union approach becomes slow with many banned words, because the regex engine doesn"t do a very good job of optimizing the pattern.
It"s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren"t really humanreadable, but they do allow for very fast lookup and match.
["foobar", "foobah", "fooxar", "foozap", "fooza"]
The list is converted to a trie:
{
"f": {
"o": {
"o": {
"x": {
"a": {
"r": {
"": 1
}
}
},
"b": {
"a": {
"r": {
"": 1
},
"h": {
"": 1
}
}
},
"z": {
"a": {
"": 1,
"p": {
"": 1
}
}
}
}
}
}
}
And then to this regex pattern:
r"foo(?:ba[hr]xarzap?)"
The huge advantage is that to test if zoo
matches, the regex engine only needs to compare the first character (it doesn"t match), instead of trying the 5 words. It"s a preprocess overkill for 5 words, but it shows promising results for many thousand words.
Note that (?:)
noncapturing groups are used because:
foobarbaz
would match foobar
or baz
, but not foobaz
foo(barbaz)
would save unneeded information to a capturing group.Here"s a slightly modified gist, which we can use as a trie.py
library:
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[""] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append("[" + "".join(cc) + "]")
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
Here"s a small test (the same as this one):
# Encoding: utf8
import re
import timeit
import random
from trie import Trie
with open("/usr/share/dict/americanenglish") as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)
test_words = [
("Surely not a word", "#surely_N√∂T√§WORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[1]),
("Almost a word", "couldbeaword")
]
def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"" + trie.pattern() + r"", re.IGNORECASE)
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("
TrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))
It outputs:
TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms
TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms
TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms
TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms
TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
For info, the regex begins like this:
(?:a(?:(?:"sa(?:"schenliyah(?:"s)?r(?:dvark(?:(?:"ss))?on))b(?:"sa(?:c(?:us(?:(?:"ses))?[ik])ftlone(?:(?:"ss))?ndon(?:(?:edingment(?:"s)?s))?s(?:e(?:(?:ment(?:"s)?[ds]))?h(?:(?:e[ds]ing))?ing)t(?:e(?:(?:ment(?:"s)?[ds]))?ingtoir(?:(?:"ss))?))b(?:as(?:id)?e(?:ss(?:(?:"ses))?y(?:(?:"ss))?)ot(?:(?:"st(?:"s)?s))?reviat(?:e[ds]?i(?:ngon(?:(?:"ss))?))y(?:"s)?√©(?:(?:"ss))?)d(?:icat(?:e[ds]?i(?:ngon(?:(?:"ss))?))om(?:en(?:(?:"ss))?inal)u(?:ct(?:(?:edi(?:ngon(?:(?:"ss))?)or(?:(?:"ss))?s))?l(?:"s)?))e(?:(?:"saml(?:(?:"sardson(?:"s)?))?r(?:deen(?:"s)?nathy(?:"s)?ra(?:nttion(?:(?:"ss))?))t(?:(?:t(?:e(?:r(?:(?:"ss))?d)ingor(?:(?:"ss))?)s))?yance(?:"s)?d))?hor(?:(?:r(?:e(?:n(?:ce(?:"s)?t)d)ing)s))?i(?:d(?:e[ds]?ingjan(?:"s)?)gaill(?:eneit(?:iesy(?:"s)?)))j(?:ect(?:ly)?ur(?:ation(?:(?:"ss))?e[ds]?ing))l(?:a(?:tive(?:(?:"ss))?ze)e(?:(?:str))?oomution(?:(?:"ss))?y)m"sn(?:e(?:gat(?:e[ds]?i(?:ngon(?:"s)?))r(?:"s)?)ormal(?:(?:it(?:iesy(?:"s)?)ly))?)o(?:ardde(?:(?:"ss))?li(?:sh(?:(?:e[ds]ing))?tion(?:(?:"sist(?:(?:"ss))?))?)mina(?:bl[ey]t(?:e[ds]?i(?:ngon(?:(?:"ss))?)))r(?:igin(?:al(?:(?:"ss))?e(?:(?:"ss))?)t(?:(?:edi(?:ngon(?:(?:"sist(?:(?:"ss))?s))?ve)s))?)u(?:nd(?:(?:edings))?t)ve(?:(?:"sboard))?)r(?:a(?:cadabra(?:"s)?d(?:e[ds]?ing)ham(?:"s)?m(?:(?:"ss))?si(?:on(?:(?:"ss))?ve(?:(?:"slyness(?:"s)?s))?))eastidg(?:e(?:(?:ment(?:(?:"ss))?[ds]))?ingment(?:(?:"ss))?)o(?:adgat(?:e[ds]?i(?:ngon(?:(?:"ss))?)))upt(?:(?:e(?:str)lyness(?:"s)?))?)s(?:alomc(?:ess(?:(?:"se[ds]ing))?issa(?:(?:"s[es]))?ond(?:(?:edings))?)en(?:ce(?:(?:"ss))?t(?:(?:e(?:e(?:(?:"sism(?:"s)?s))?d)inglys))?)inth(?:(?:"se(?:"s)?))?o(?:l(?:ut(?:e(?:(?:"slyst?))?i(?:on(?:"s)?sm(?:"s)?))v(?:e[ds]?ing))r(?:b(?:(?:e(?:n(?:cy(?:"s)?t(?:(?:"ss))?)d)ings))?pti...
It"s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!
Here"s a diagram of the complete trie, exported with triepythongraphviz and graphviz twopi
: