[MUSIC PLAYING] FRANCOIS CHOLLET: Hi, Im Francois.
Im a software engineer on the Keras team.
In this session, well talk about data preprocessing for machine learning-- where it is, where the chances are, and how the Keras API can make it easier for you.
There are two big use cases for data preprocessing.
The first one is whats called data vectorization.
If your input data contains text or categorical values, you cannot feed it directly into a neural network.
Neural networks can only process numerical values.
And further, its usually a good idea to normalize our risk scale numerical inputs to restrict them to small values, typically in the zero to one range.
The second use case is data augmentation.
Data augmentation is mainly used when processing images.
The idea is to generate random variations of an incoming image on the fly so as to expose your model to a greater diversity of inputs during training.
Its a great way to make the most out of the small datasets and train models to generalize better to new images.
A key challenge with data preprocessing is whats known as a training-serving skew.
Machine-learning models can only make sense of inputs that stay very close to what theyve seen before.
For this reason, its very important that the preprocessing setup that youre using when you deploy your model in production stays very close to the preprocessing setup you used when you originally trained your model.
For instance, if you deploy a text classification model in a mobile app, youre going to need to make sure to recreate in your mobile app the same text and coding system use at training time.
This can potentially get very challenging.
Any small discrepancy can have a large impact on real-world performance.
And this is something that weve repeatedly observed with production in our systems at Google.
This is a very difficult and very important challenge.
We designed the Keras preprocessing layers API specifically to address this challenge.
Keras preprocessing layers are modular building blocks that encapsulate common preprocessing steps such as vectorizing text, rescaling image values, hashing category features, and so on.
The key feature of this API is that it enables you to place preprocessing computation either in your data pipeline or directly into your model.
This means that you can create end-to-end models that are capable of processing raw inputs such as text or structured data dictionaries.
And these models can be deployed as-is.
That way, you dont have to implement your preprocessing logic when you deploy in a new environment.
Because the preprocessing logic is part of the model itself, youre guaranteed that youre using the same logic in production as what you use during training.
Well dive deeper into this feature in a couple of slides.
To make things concrete, lets look at a practical example.
Were going to do text preprocessing for sentiment classification model.
In this example, we look at text files on disk.
Each text file contains a movie review from the Internet Movie Database.
We have two folders.
One folder with reviews that had a high-star rating associated with them, positive reviews, and one folder with reviews that had a low-star rating, negative reviews.
Here on the steps of our workflow.
We start with text files.
We set up a tiered data pipeline to read the files on disk and turn them into a dataset object that outputs string [? tensors. ?] Then we use Keras preprocessing layers to turn the string [? tensors ?] into numerical values.
Finally, were able to fit this numerical values into a Keras model for classification.
The first step is super easy.
We can do it in one line with the utility image_dataset_from_directory.
It will look at the subfolders of our views directory, it will list the text files in these folders, and it will interpret each subfolder as containing examples for one category in a classification problem.
It will then create a tier of data dataset object that outputs text strings and their corresponding labels.
Here we have two subfolders, positive and negative.
So well get a binary classification dataset.
Next, we need to encode the text strings into something that can be processed by a neural network.
This involves several steps.
First, we need to standardize each string by converting text to lowercase and removing the punctuation.
Next, we need to split the string into individual words, which is called tokenization.
Then we need to build a vocabulary that maps unique words to indices in a dictionary.
Once we have that vocabulary, we can look up the words that compose each string and convert the string into a sequence of integer indices.
Finally, well turn our indices into numerical vectors.
There are multiple ways to do this.
But one of the simplest ways is to apply multi-hot encoding to the indices to turn each string into a single binary vector.
Keras makes this entire process easy via the text vectorazation layer.
It offers a range of common options for text encoding including different ways to customize this normalization step and the tokenization step, and different ways to turn indices into vectors.
In this case, well consider the top 10,000 most common words in the dataset.
Well summarize by converting the text to lower case and stripping the punctuation.
And we split the text on a white space.
Finally, well encode the indices via multi-hot encoding.
Before we can start using the layer, we have to learn the vocabulary.
And we do so by using the adapt method.
Here, we call adapt with dataset that outputs the text samples of the trained data and which discards the labels.
As you can see, once the layers learn the vocabulary, its capable of turning a string into a binary vector that encodes what words were present in the string.
Here is an important issue when preprocessing data and particular text data.
You could do it sequentially by proposing one batch of data on CPU and then fitting it to your model that runs on GPU.
But then your GPU will end up being idle a lot of the time.
It will be waiting for the preprocessing stage to be done with the data batch before it can start looking at it.
The solution is to do preprocessing and training in parallel, asynchronously.
TensorFlow is a great API to do this in tier of data.
You can map preprocessing computation into your tier of data datasets to be handled asynchronously on CPU while your GPU is processing the previous batch.
Lets take a look.
You can simply take your label data sets and match your vectorization layer into it while specifying how many threads of parallel computation you want to use for the CPU preprocessing.
If you were to do it in pure Python without TensorFlow, it would be quite challenging.
But TensorFlow solves the problem in a single line of code.
At this point, we have a dataset that outputs preprocessed text samples and their labels.
I said earlier that processing layers enable you to place your preprocessing either in the data pipeline, which is great during training, or in the model itself, which is great for inference since it bypasses the problem of the training serving skew.
And heres how it works.
For training, we construct a model that expects preprocessed inputs.
And for inference, we construct a model that expects raw string inputs.
This model includes the text vectorization layer as the first layer in the model.
Finally, we can train the model.
And once its trained, we can use our end-to-end model for inference and raw strings.
And thats the end of our talk.
And remember, the key feature of Keras preprocessing layers is their versatility.
You can use them in a data pipeline to do asynchronous parallel preprocessing during training.
And you can also use them directly as part of the model to create end-to-end models that package their own preprocessing.
And thats great for inference since it enables you to bypass the training settings skew problem.
To learn more about which can do with class preprocessing layers, check out our guide and keras.io.
The link is on this slide.
Thanks for watching.