I"m looking for a way to test whether or not a given string repeats itself for the entire string or not.
Examples:
[
"0045662100456621004566210045662100456621", # "00456621"
"0072992700729927007299270072992700729927", # "00729927"
"001443001443001443001443001443001443001443", # "001443"
"037037037037037037037037037037037037037037037", # "037"
"047619047619047619047619047619047619047619", # "047619"
"002457002457002457002457002457002457002457", # "002457"
"001221001221001221001221001221001221001221", # "001221"
"001230012300123001230012300123001230012300123", # "00123"
"0013947001394700139470013947001394700139470013947", # "0013947"
"001001001001001001001001001001001001001001001001001", # "001"
"001406469760900140646976090014064697609", # "0014064697609"
]
are strings which repeat themselves, and
[
"004608294930875576036866359447",
"00469483568075117370892018779342723",
"004739336492890995260663507109",
"001508295625942684766214177978883861236802413273",
"007518796992481203",
"0071942446043165467625899280575539568345323741",
"0434782608695652173913",
"0344827586206896551724137931",
"002481389578163771712158808933",
"002932551319648093841642228739",
"0035587188612099644128113879",
"003484320557491289198606271777",
"00115074798619102416570771",
]
are examples of ones that do not.
The repeating sections of the strings I"m given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can"t see any intuitive solution.
I"ve looked into regexes a bit and they seem good for when you know what you"re looking for, or at least the length of the pattern you"re looking for. Unfortunately, I know neither.
How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?
How can I tell if a string repeats itself in Python? ones: Questions
Is there a list of Pytz Timezones?
3 answers
I would like to know what are all the possible values for the timezone argument in the Python library pytz. How to do it?
Answer #1
You can list all the available timezones with pytz.all_timezones
:
In [40]: import pytz
In [41]: pytz.all_timezones
Out[42]:
["Africa/Abidjan",
"Africa/Accra",
"Africa/Addis_Ababa",
...]
There is also pytz.common_timezones
:
In [45]: len(pytz.common_timezones)
Out[45]: 403
In [46]: len(pytz.all_timezones)
Out[46]: 563
Python strptime() and timezones?
3 answers
I have a CSV dumpfile from a Blackberry IPD backup, created using IPDDump.
The date/time strings in here look something like this
(where EST
is an Australian time-zone):
Tue Jun 22 07:46:22 EST 2010
I need to be able to parse this date in Python. At first, I tried to use the strptime()
function from datettime.
>>> datetime.datetime.strptime("Tue Jun 22 12:10:20 2010 EST", "%a %b %d %H:%M:%S %Y %Z")
However, for some reason, the datetime
object that comes back doesn"t seem to have any tzinfo
associated with it.
I did read on this page that apparently datetime.strptime
silently discards tzinfo
, however, I checked the documentation, and I can"t find anything to that effect documented here.
I have been able to get the date parsed using a third-party Python library, dateutil, however I"m still curious as to how I was using the in-built strptime()
incorrectly? Is there any way to get strptime()
to play nicely with timezones?
Answer #1
I recommend using python-dateutil. Its parser has been able to parse every date format I"ve thrown at it so far.
>>> from dateutil import parser
>>> parser.parse("Tue Jun 22 07:46:22 EST 2010")
datetime.datetime(2010, 6, 22, 7, 46, 22, tzinfo=tzlocal())
>>> parser.parse("Fri, 11 Nov 2011 03:18:09 -0400")
datetime.datetime(2011, 11, 11, 3, 18, 9, tzinfo=tzoffset(None, -14400))
>>> parser.parse("Sun")
datetime.datetime(2011, 12, 18, 0, 0)
>>> parser.parse("10-11-08")
datetime.datetime(2008, 10, 11, 0, 0)
and so on. No dealing with strptime()
format nonsense... just throw a date at it and it Does The Right Thing.
Update: Oops. I missed in your original question that you mentioned that you used dateutil
, sorry about that. But I hope this answer is still useful to other people who stumble across this question when they have date parsing questions and see the utility of that module.
Fitting empirical distribution to theoretical ones with Scipy (Python)?
3 answers
INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]
sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn"t matter for this problem.
PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.
I don"t know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.
Is there a way to implement such an analysis in Python (Scipy
or Numpy
)?
Could you present any examples?
Thank you!
Answer #1
Distribution Fitting with Sum of Square Error (SSE)
This is an update and modification to Saullo"s answer, that uses the full list of the current scipy.stats
distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.
Example Fitting
Using the El Niño dataset from statsmodels
, the distributions are fit and error is determined. The distribution with the least error is returned.
All Distributions
Best Fit Distribution
Example Code
%matplotlib inline
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)
matplotlib.style.use("ggplot")
# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
# Best holders
best_distributions = []
# Estimate distribution parameters from data
for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):
print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))
distribution = getattr(st, distribution)
# Try to fit the distribution
try:
# Ignore warnings from data that can"t be fit
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
# fit dist to data
params = distribution.fit(data)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass
# identify if this distribution is better
best_distributions.append((distribution, params, sse))
except Exception:
pass
return sorted(best_distributions, key=lambda x:x[2])
def make_pdf(dist, params, size=10000):
"""Generate distributions"s Probability Distribution Function """
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Get sane start and end points of distribution
start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)
return pdf
# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index("YEAR").values.ravel())
# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])[1]["color"])
# Save plot limits
dataYLim = ax.get_ylim()
# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]
# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u"El Niño sea temp.
All Fitted Distributions")
ax.set_xlabel(u"Temp (°C)")
ax.set_ylabel("Frequency")
# Make PDF with best params
pdf = make_pdf(best_dist[0], best_dist[1])
# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label="PDF", legend=True)
data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)
param_names = (best_dist[0].shapes + ", loc, scale").split(", ") if best_dist[0].shapes else ["loc", "scale"]
param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = "{}({})".format(best_dist[0].name, param_str)
ax.set_title(u"El Niño sea temp. with best fit distribution
" + dist_str)
ax.set_xlabel(u"Temp. (°C)")
ax.set_ylabel("Frequency")
How can I tell if a string repeats itself in Python? repeat: Questions
Create list of single item repeated N times
5 answers
I want to create a series of lists, all of varying lengths. Each list will contain the same element e
, repeated n
times (where n
= length of the list).
How do I create the lists, without using a list comprehension [e for number in xrange(n)]
for each list?
Answer #1
You can also write:
[e] * n
You should note that if e is for example an empty list you get a list with n references to the same list, not n independent empty lists.
Performance testing
At first glance it seems that repeat is the fastest way to create a list with n identical elements:
>>> timeit.timeit("itertools.repeat(0, 10)", "import itertools", number = 1000000)
0.37095273281943264
>>> timeit.timeit("[0] * 10", "import itertools", number = 1000000)
0.5577236771712819
But wait - it"s not a fair test...
>>> itertools.repeat(0, 10)
repeat(0, 10) # Not a list!!!
The function itertools.repeat
doesn"t actually create the list, it just creates an object that can be used to create a list if you wish! Let"s try that again, but converting to a list:
>>> timeit.timeit("list(itertools.repeat(0, 10))", "import itertools", number = 1000000)
1.7508119747063233
So if you want a list, use [e] * n
. If you want to generate the elements lazily, use repeat
.
What is the best way to repeatedly execute a function every x seconds?
5 answers
I want to repeatedly execute a function in Python every 60 seconds forever (just like an NSTimer in Objective C). This code will run as a daemon and is effectively like calling the python script every minute using a cron, but without requiring that to be set up by the user.
In this question about a cron implemented in Python, the solution appears to effectively just sleep() for x seconds. I don"t need such advanced functionality so perhaps something like this would work
while True:
# Code executed here
time.sleep(60)
Are there any foreseeable problems with this code?
Answer #1
If your program doesn"t have a event loop already, use the sched module, which implements a general purpose event scheduler.
import sched, time
s = sched.scheduler(time.time, time.sleep)
def do_something(sc):
print("Doing stuff...")
# do your stuff
s.enter(60, 1, do_something, (sc,))
s.enter(60, 1, do_something, (s,))
s.run()
If you"re already using an event loop library like asyncio
, trio
, tkinter
, PyQt5
, gobject
, kivy
, and many others - just schedule the task using your existing event loop library"s methods, instead.
Answer #2
Lock your time loop to the system clock like this:
import time
starttime = time.time()
while True:
print "tick"
time.sleep(60.0 - ((time.time() - starttime) % 60.0))