Python | pandas.date_range () method

date_range | Python Methods and Functions

pandas.date_range() is one of the core functions in Pandas that is used to return a fixed DatetimeIndex frequency.

Syntax: pandas.date_range (start = None, end = None, periods = None, freq = None, tz = None, normalize = False, name = None, closed = None, ** kwargs)

Parameters:
start: Left bound for generating dates.
end: Right bound for generating dates .
periods: Number of periods to generate.
freq: Frequency strings can have multiples, eg `5H`. See here for a list of frequency aliases.
tz: Time zone name for returning localized DatetimeIndex. By default, the resulting DatetimeIndex is timezone-naive.
normalize: Normalize start / end dates to midnight before generating date range.
name: Name of the resulting DatetimeIndex.
closed: Make the interval closed with respect to the given frequency to the `left`, `right`, or both sides (None, the default).

Returns: DatetimeIndex

Code # 1:

# import pandas as pd

import pandas as pd

 

per1 = pd.date_range (start = `1-1-2018`

  end = ` 1-05-2018` , freq = `5H` )

  

for val in per1:

  print (val)

Output:

Code # 2:

# import pandas as pd

import pandas as pd

  

dRan1 = pd. date_range (start = `1-1-2018` ,

end = `8-01-2018` , freq = ` M` )

 

dRan2 = pd.date_range (start = `1-1-2018`

  end = ` 11-01-2018` , freq = `3M`  )

 

print (dRan1, ` ` , dRan2)

Output:

Code # 3:

# import pandas as pd

import pandas as pd

 
# Specify the beginning and periods, the number of periods (days).

dRan1 = pd.date_range (start = `1-1-2018` , p eriods = 13 )

 
# Specify the end and periods, the number of periods (days).

dRan2 = pd.date_range (end = `1-1-2018` , periods = 13 )

 
# Specify start, end and periods; Frequency
# generated automatically (with linear spacing).

dRan3 = pd.date_range (start = `01-03-2017`

  end = ` 1-1-2018` , periods = 13 )

 

print (dRan1, "" , dRan2, `` , dRan3)

Output:

Code # 4:

# import pandas as pd

import pandas as pd

 
# Specify the start and periods, the number of periods (days).

dRan1 = pd.date_range (start = `1-1-2018`

  periods = 13 , tz = `Asia / Tokyo` )

 
dRan1

Output:





Python | pandas.date_range () method: StackOverflow Questions

How do I filter query objects by date range in Django?

I"ve got a field in one model like:

class Sample(models.Model):
    date = fields.DateField(auto_now=False)

Now, I need to filter the objects by a date range.

How do I filter all the objects that have a date between 1-Jan-2011 and 31-Jan-2011?

Answer #1

Label vs. Location

The main distinction between the two methods is:

  • loc gets rows (and/or columns) with particular labels.

  • iloc gets rows (and/or columns) at integer locations.

To demonstrate, consider a series s of characters with a non-monotonic integer index:

>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2]) 
49    a
48    b
47    c
0     d
1     e
2     f

>>> s.loc[0]    # value at index label 0
"d"

>>> s.iloc[0]   # value at index location 0
"a"

>>> s.loc[0:1]  # rows at index labels between 0 and 1 (inclusive)
0    d
1    e

>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49    a

Here are some of the differences/similarities between s.loc and s.iloc when passed various objects:

<object> description s.loc[<object>] s.iloc[<object>]
0 single item Value at index label 0 (the string "d") Value at index location 0 (the string "a")
0:1 slice Two rows (labels 0 and 1) One row (first row at location 0)
1:47 slice with out-of-bounds end Zero rows (empty Series) Five rows (location 1 onwards)
1:47:-1 slice with negative step three rows (labels 1 back to 47) Zero rows (empty Series)
[2, 0] integer list Two rows with given labels Two rows with given locations
s > "e" Bool series (indicating which values have the property) One row (containing "f") NotImplementedError
(s>"e").values Bool array One row (containing "f") Same as loc
999 int object not in index KeyError IndexError (out of bounds)
-1 int object not in index KeyError Returns last value in s
lambda x: x.index[3] callable applied to series (here returning 3rd item in index) s.loc[s.index[3]] s.iloc[s.index[3]]

loc"s label-querying capabilities extend well-beyond integer indexes and it"s worth highlighting a couple of additional examples.

Here"s a Series where the index contains string objects:

>>> s2 = pd.Series(s.index, index=s.values)
>>> s2
a    49
b    48
c    47
d     0
e     1
f     2

Since loc is label-based, it can fetch the first value in the Series using s2.loc["a"]. It can also slice with non-integer objects:

>>> s2.loc["c":"e"]  # all rows lying between "c" and "e" (inclusive)
c    47
d     0
e     1

For DateTime indexes, we don"t need to pass the exact date/time to fetch by label. For example:

>>> s3 = pd.Series(list("abcde"), pd.date_range("now", periods=5, freq="M")) 
>>> s3
2021-01-31 16:41:31.879768    a
2021-02-28 16:41:31.879768    b
2021-03-31 16:41:31.879768    c
2021-04-30 16:41:31.879768    d
2021-05-31 16:41:31.879768    e

Then to fetch the row(s) for March/April 2021 we only need:

>>> s3.loc["2021-03":"2021-04"]
2021-03-31 17:04:30.742316    c
2021-04-30 17:04:30.742316    d

Rows and Columns

loc and iloc work the same way with DataFrames as they do with Series. It"s useful to note that both methods can address columns and rows together.

When given a tuple, the first element is used to index the rows and, if it exists, the second element is used to index the columns.

Consider the DataFrame defined below:

>>> import numpy as np 
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),  
                      index=list("abcde"), 
                      columns=["x","y","z", 8, 9])
>>> df
    x   y   z   8   9
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

Then for example:

>>> df.loc["c": , :"z"]  # rows "c" and onwards AND columns up to "z"
    x   y   z
c  10  11  12
d  15  16  17
e  20  21  22

>>> df.iloc[:, 3]        # all rows, but only the column at index location 3
a     3
b     8
c    13
d    18
e    23

Sometimes we want to mix label and positional indexing methods for the rows and columns, somehow combining the capabilities of loc and iloc.

For example, consider the following DataFrame. How best to slice the rows up to and including "c" and take the first four columns?

>>> import numpy as np 
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),  
                      index=list("abcde"), 
                      columns=["x","y","z", 8, 9])
>>> df
    x   y   z   8   9
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

We can achieve this result using iloc and the help of another method:

>>> df.iloc[:df.index.get_loc("c") + 1, :4]
    x   y   z   8
a   0   1   2   3
b   5   6   7   8
c  10  11  12  13

get_loc() is an index method meaning "get the position of the label in this index". Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row "c" as well.

Answer #2

There are two possible solutions:

  • Use a boolean mask, then use df.loc[mask]
  • Set the date column as a DatetimeIndex, then use df[start_date : end_date]

Using a boolean mask:

Ensure df["date"] is a Series with dtype datetime64[ns]:

df["date"] = pd.to_datetime(df["date"])  

Make a boolean mask. start_date and end_date can be datetime.datetimes, np.datetime64s, pd.Timestamps, or even datetime strings:

#greater than the start date and smaller than the end date
mask = (df["date"] > start_date) & (df["date"] <= end_date)

Select the sub-DataFrame:

df.loc[mask]

or re-assign to df

df = df.loc[mask]

For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df["date"] = pd.date_range("2000-1-1", periods=200, freq="D")
mask = (df["date"] > "2000-6-1") & (df["date"] <= "2000-6-10")
print(df.loc[mask])

yields

            0         1         2       date
153  0.208875  0.727656  0.037787 2000-06-02
154  0.750800  0.776498  0.237716 2000-06-03
155  0.812008  0.127338  0.397240 2000-06-04
156  0.639937  0.207359  0.533527 2000-06-05
157  0.416998  0.845658  0.872826 2000-06-06
158  0.440069  0.338690  0.847545 2000-06-07
159  0.202354  0.624833  0.740254 2000-06-08
160  0.465746  0.080888  0.155452 2000-06-09
161  0.858232  0.190321  0.432574 2000-06-10

Using a DatetimeIndex:

If you are going to do a lot of selections by date, it may be quicker to set the date column as the index first. Then you can select rows by date using df.loc[start_date:end_date].

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df["date"] = pd.date_range("2000-1-1", periods=200, freq="D")
df = df.set_index(["date"])
print(df.loc["2000-6-1":"2000-6-10"])

yields

                   0         1         2
date                                    
2000-06-01  0.040457  0.326594  0.492136    # <- includes start_date
2000-06-02  0.279323  0.877446  0.464523
2000-06-03  0.328068  0.837669  0.608559
2000-06-04  0.107959  0.678297  0.517435
2000-06-05  0.131555  0.418380  0.025725
2000-06-06  0.999961  0.619517  0.206108
2000-06-07  0.129270  0.024533  0.154769
2000-06-08  0.441010  0.741781  0.470402
2000-06-09  0.682101  0.375660  0.009916
2000-06-10  0.754488  0.352293  0.339337

While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.


Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df["date"] = pd.to_datetime(df["date"]).

Answer #3

apply, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

"If apply is so bad, then why is it in the API?"

DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.

Some of the things apply can do:

  • Run any user-defined function on a DataFrame or Series
  • Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
  • Perform index alignment while applying the function
  • Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
  • Perform element-wise transformations
  • Broadcast aggregated results to original rows (see the result_type argument).
  • Accept positional/keyword arguments to pass to the user-defined functions.

...Among others. For more information, see Row or Column-wise Function Application in the documentation.

So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.

There are very few situations where apply is appropriate to use (more on that below). If you"re not sure whether you should be using apply, you probably shouldn"t.



Let"s address the next question.

"How and when should I make my code apply-free?"

To rephrase, here are some common situations where you will want to get rid of any calls to apply.

Numeric Data

If you"re working with numeric data, there is likely already a vectorized cython function that does exactly what you"re trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).

Contrast the performance of apply for a simple addition operation.

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

<!- ->

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

Performance wise, there"s no comparison, the cythonized equivalent is much faster. There"s no need for a graph, because the difference is obvious even for toy data.

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even if you enable passing raw arrays with the raw argument, it"s still twice as slow.

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another example:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, seek out vectorized alternatives if possible.


String/Regex

Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.

A common problem is to check whether a value in a column is present in another column of the same row.

df = pd.DataFrame({
    "Name": ["mickey", "donald", "minnie"],
    "Title": ["wonderland", "welcome to donald"s castle", "Minnie mouse clubhouse"],
    "Value": [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald"s castle
2  minnie     86      Minnie mouse clubhouse

This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.

Using apply, this would be done using

df.apply(lambda x: x["Name"].lower() in x["Title"].lower(), axis=1)

0    False
1     True
2     True
dtype: bool
 
df[df.apply(lambda x: x["Name"].lower() in x["Title"].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald"s castle     10
2  minnie      Minnie mouse clubhouse     86

However, a better solution exists using list comprehensions.

df[[y.lower() in x.lower() for x, y in zip(df["Title"], df["Name"])]]

     Name                       Title  Value
1  donald  welcome to donald"s castle     10
2  minnie      Minnie mouse clubhouse     86

<!- ->

%timeit df[df.apply(lambda x: x["Name"].lower() in x["Title"].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df["Title"], df["Name"])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.

For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df["date"]), over, say, df["date"].apply(pd.to_datetime).

Read more at the docs.


A Common Pitfall: Exploding Columns of Lists

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

People are tempted to use apply(pd.Series). This is horrible in terms of performance.

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

A better option is to listify the column and pass it to pd.DataFrame.

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

<!- ->

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Lastly,

"Are there any situations where apply is good?"

Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.

df = pd.DataFrame(
         pd.date_range("2018-12-31","2019-01-31", freq="2D").date.astype(str).reshape(-1, 2), 
         columns=["date1", "date2"])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object
    

This is an admissible case for apply:

df.apply(pd.to_datetime, errors="coerce").dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

%timeit df.apply(pd.to_datetime, errors="coerce")
%timeit pd.to_datetime(df.stack(), errors="coerce").unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors="coerce") for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors="coerce")

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can make a similar case for other operations such as string operations, or conversion to category.

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

And so on...


Converting Series to str: astype versus apply

This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.

enter image description here The graph was plotted using the perfplot library.

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=["astype", "apply"],
    n_range=[2**k for k in range(1, 20)],
    xlabel="N",
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.


GroupBy operations with chained transformations

GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.

One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":

df = pd.DataFrame({"A": list("aabcccddee"), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

<!- ->

You"d need two successive groupby calls here:

df.groupby("A").B.cumsum().groupby(df.A).shift()
 
0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Using apply, you can shorten this to a a single call.

df.groupby("A").B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).



Other Caveats

Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.

df = pd.DataFrame({
    "A": [1, 2],
    "B": ["x", "y"]
})

def func(x):
    print(x["A"])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)

Answer #4

To answer my own question, this functionality has been added to pandas in the meantime. Starting from pandas 0.15.0, you can use tz_localize(None) to remove the timezone resulting in local time.
See the whatsnew entry: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#timezone-handling-improvements

So with my example from above:

In [4]: t = pd.date_range(start="2013-05-18 12:00:00", periods=2, freq="H",
                          tz= "Europe/Brussels")

In [5]: t
Out[5]: DatetimeIndex(["2013-05-18 12:00:00+02:00", "2013-05-18 13:00:00+02:00"],
                       dtype="datetime64[ns, Europe/Brussels]", freq="H")

using tz_localize(None) removes the timezone information resulting in naive local time:

In [6]: t.tz_localize(None)
Out[6]: DatetimeIndex(["2013-05-18 12:00:00", "2013-05-18 13:00:00"], 
                      dtype="datetime64[ns]", freq="H")

Further, you can also use tz_convert(None) to remove the timezone information but converting to UTC, so yielding naive UTC time:

In [7]: t.tz_convert(None)
Out[7]: DatetimeIndex(["2013-05-18 10:00:00", "2013-05-18 11:00:00"], 
                      dtype="datetime64[ns]", freq="H")

This is much more performant than the datetime.replace solution:

In [31]: t = pd.date_range(start="2013-05-18 12:00:00", periods=10000, freq="H",
                           tz="Europe/Brussels")

In [32]: %timeit t.tz_localize(None)
1000 loops, best of 3: 233 µs per loop

In [33]: %timeit pd.DatetimeIndex([i.replace(tzinfo=None) for i in t])
10 loops, best of 3: 99.7 ms per loop

Answer #5

If you want to replace an empty string and records with only spaces, the correct answer is!:

df = df.replace(r"^s*$", np.nan, regex=True)

The accepted answer

df.replace(r"s+", np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

df = pd.DataFrame([
    [-0.532681, "foo", 0],
    [1.490752, "bar", 1],
    [-1.387326, "fo o", 2],
    [0.814772, "baz", " "],     
    [-0.222552, "   ", 4],
    [-1.176781,  "qux", ""],         
], columns="A B C".split(), index=pd.date_range("2000-01-01","2000-01-06"))

Note, also that "fo o" is not replaced with Nan, though it contains a space. Further note, that a simple:

df.replace(r"", np.NaN)

Does not work either - try it out.

Answer #6

How to create sample datasets

This is to mainly to expand on AndyHayden"s answer by providing examples of how you can create sample dataframes. Pandas and (especially) NumPy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing NumPy and Pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

import numpy as np
import pandas as pd

np.random.seed(123)

A kitchen sink example

Here"s an example showing a variety of things you can do. All kinds of useful sample dataframes could be created from a subset of this:

df = pd.DataFrame({

    # some ways to create random data
    "a":np.random.randn(6),
    "b":np.random.choice( [5,7,np.nan], 6),
    "c":np.random.choice( ["panda","python","shark"], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to R"s expand.grid(), see note 2 below
    "d":np.repeat( range(3), 2 ),
    "e":np.tile(   range(2), 3 ),

    # a date range and set of random dates
    "f":pd.date_range("1/1/2011", periods=6, freq="D"),
    "g":np.random.choice( pd.date_range("1/1/2011", periods=365,
                          freq="D"), 6, replace=False)
    })

This produces:

          a   b       c  d  e          f          g
0 -1.085631 NaN   panda  0  0 2011-01-01 2011-08-12
1  0.997345   7   shark  0  1 2011-01-02 2011-11-10
2  0.282978   5   panda  1  0 2011-01-03 2011-10-30
3 -1.506295   7  python  1  1 2011-01-04 2011-09-07
4 -0.578600 NaN   shark  2  0 2011-01-05 2011-02-27
5  1.651437   7  python  2  1 2011-01-06 2011-02-03

Some notes:

  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r"s expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for R"s expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

Fake stock market data

In addition to taking subsets of the above code, you can further combine the techniques to do just about anything. For example, here"s a short example that combines np.tile and date_range to create sample ticker data for 4 stocks covering the same dates:

stocks = pd.DataFrame({
    "ticker":np.repeat( ["aapl","goog","yhoo","msft"], 25 ),
    "date":np.tile( pd.date_range("1/1/2011", periods=25, freq="D"), 4 ),
    "price":(np.random.randn(100).cumsum() + 10) })

Now we have a sample dataset with 100 lines (25 dates per ticker), but we have only used 4 lines to do it, making it easy for everyone else to reproduce without copying and pasting 100 lines of code. You can then display subsets of the data if it helps to explain your question:

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby("ticker").head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

Answer #7

Use

Sample.objects.filter(date__range=["2011-01-01", "2011-01-31"])

Or if you are just trying to filter month wise:

Sample.objects.filter(date__year="2011", 
                      date__month="01")

Edit

As Bernhard Vallant said, if you want a queryset which excludes the specified range ends you should consider his solution, which utilizes gt/lt (greater-than/less-than).

Answer #8

Here"s a couple of suggestions:

Use date_range for the index:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq="D")

columns = ["A","B", "C"]

Note: we could create an empty DataFrame (with NaNs) simply by writing:

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

To do these type of calculations for the data, use a numpy array:

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

Answer #9

Pandas is great for time series in general, and has direct support for date ranges.

For example pd.date_range():

import pandas as pd
from datetime import datetime

datelist = pd.date_range(datetime.today(), periods=100).tolist()

It also has lots of options to make life easier. For example if you only wanted weekdays, you would just swap in bdate_range.

See date range documentation

In addition it fully supports pytz timezones and can smoothly span spring/autumn DST shifts.

EDIT by OP:

If you need actual python datetimes, as opposed to Pandas timestamps:

import pandas as pd
from datetime import datetime

pd.date_range(end = datetime.today(), periods = 100).to_pydatetime().tolist()

#OR

pd.date_range(start="2018-09-09",end="2020-02-02")

This uses the "end" parameter to match the original question, but if you want descending dates:

pd.date_range(datetime.today(), periods=100).to_pydatetime().tolist()

Answer #10

You could use Series.reindex:

import pandas as pd

idx = pd.date_range("09-01-2013", "09-30-2013")

s = pd.Series({"09-02-2013": 2,
               "09-03-2013": 10,
               "09-06-2013": 5,
               "09-07-2013": 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...

Get Solution for free from DataCamp guru