# Stratified Train/Test-split in scikit-learn

|

👻 Check our latest review to choose the best laptop for Machine Learning engineers and Deep learning tasks!

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:

``````X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
``````

However, I"d like to stratify my training dataset. How do I do that? I"ve been looking into the `StratifiedKFold` method, but doesn"t let me specifiy the 75%/25% split and only stratify the training dataset.

👻 Read also: what is the best laptop for engineering students?

## Stratified Train/Test-split in scikit-learn split: Questions

How do you split a list into evenly sized chunks?

By jespern

I have a list of arbitrary length, and I need to split it up into equal size chunks and operate on it. There are some obvious ways to do this, like keeping a counter and two lists, and when the second list fills up, add it to the first list and empty the second list for the next round of data, but this is potentially extremely expensive.

I was wondering if anyone had a good solution to this for lists of any length, e.g. using generators.

I was looking for something useful in `itertools` but I couldn"t find anything obviously useful. Might"ve missed it, though.

2632

Here"s a generator that yields the chunks you want:

``````def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
``````

``````import pprint
pprint.pprint(list(chunks(range(10, 75), 10)))
[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74]]
``````

If you"re using Python 2, you should use `xrange()` instead of `range()`:

``````def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in xrange(0, len(lst), n):
yield lst[i:i + n]
``````

Also you can simply use list comprehension instead of writing a function, though it"s a good idea to encapsulate operations like this in named functions so that your code is easier to understand. Python 3:

``````[lst[i:i + n] for i in range(0, len(lst), n)]
``````

Python 2 version:

``````[lst[i:i + n] for i in xrange(0, len(lst), n)]
``````

2632

If you want something super simple:

``````def chunks(l, n):
n = max(1, n)
return (l[i:i+n] for i in range(0, len(l), n))
``````

Use `xrange()` instead of `range()` in the case of Python 2.x

2632

Directly from the (old) Python documentation (recipes for itertools):

``````from itertools import izip, chain, repeat

"grouper(3, "abcdefg", "x") --> ("a","b","c"), ("d","e","f"), ("g","x","x")"
``````

The current version, as suggested by J.F.Sebastian:

``````#from itertools import izip_longest as zip_longest # for Python 2.x
from itertools import zip_longest # for Python 3.x
#from six.moves import zip_longest # for both (uses the six compat library)

"grouper(3, "abcdefg", "x") --> ("a","b","c"), ("d","e","f"), ("g","x","x")"
``````

I guess Guido"s time machine works‚Äîworked‚Äîwill work‚Äîwill have worked‚Äîwas working again.

These solutions work because `[iter(iterable)]*n` (or the equivalent in the earlier version) creates one iterator, repeated `n` times in the list. `izip_longest` then effectively performs a round-robin of "each" iterator; because this is the same iterator, it is advanced by each such call, resulting in each such zip-roundrobin generating one tuple of `n` items.

We hope this article has helped you to resolve the problem. Apart from Stratified Train/Test-split in scikit-learn, check other split-related topics.

Want to excel in Python? See our review of the best Python online courses 2022. If you are interested in Data Science, check also how to learn programming in R.

By the way, this material is also available in other languages:

Olivia Schteiner

Shanghai | 2022-12-03

I was preparing for my coding interview, thanks for clarifying this - Stratified Train/Test-split in scikit-learn in Python is not the simplest one. I just hope that will not emerge anymore

Cornwall Lehnman

Warsaw | 2022-12-03

Thanks for explaining! I was stuck with Stratified Train/Test-split in scikit-learn for some hours, finally got it done 🤗. Will use it in my bachelor thesis

Javier Robinson

Abu Dhabi | 2022-12-03

I was preparing for my coding interview, thanks for clarifying this - Stratified Train/Test-split in scikit-learn in Python is not the simplest one. I just hope that will not emerge anymore

## Shop

Learn programming in R: courses

\$

Best Python online courses for 2022

\$

Best laptop for Fortnite

\$

Best laptop for Excel

\$

Best laptop for Solidworks

\$

Best laptop for Roblox

\$

Best computer for crypto mining

\$

Best laptop for Sims 4

\$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

NUMPYNUMPY

How to specify multiple return types using type-hints

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

NUMPYNUMPY

glob exclude pattern

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

NUMPYNUMPY

Python CSV error: line contains NULL byte

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

## Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries