Preface to the Second Edition
I am exceptionally proud of the first edition of Data Science from Scratch. It turned out very much the book I wanted it to be. But several years of developments in data science, of progress in the Python ecosystem, and of personal growth as a developer and educator have changed what I think a first book in data science should look like.
In life, there are no do-overs. In writing, however, there are second editions. Accordingly, I’ve rewritten all the code and examples using Python 3.6 (and many of its newly introduced features, like type annotations). I’ve woven into the book an emphasis on writing clean code. I’ve replaced some of the first edition’s toy examples with more realistic ones using “real” datasets. I’ve added new material on topics such as deep learning, statistics, and natural language processing, corresponding to things that today’s data scientists are likely to be working with. (I’ve also removed some material that seems less relevant.) And I’ve gone over the book with a fine-toothed comb, fixing bugs, rewriting explanations that are less clear than they could be, and freshening up some of the jokes.
The first edition was a great book, and this edition is even better. Enjoy! Joel Grus
Joel Grus is a Software Engineer at Google. Prior to that, he was engaged in analytical work in several startups. Actively participates in informal events for data scientists. Always available on Twitter using the @joelgrus hashtag.
Technical leader, software engineer, data scientist, best-selling author, skilled communicator, and strong generalist. Enthusiastic about well-designed software, clean code, teaching and mentoring, extracting value from data, AI / ML / NLP / whatever, and functional programming.
Citations by Joel Grus
Created and led a team focused on research, design and implementation of machine learning, research and data products for the investment group. I spend a lot of time thinking about the relevance of research, a lot of time thinking about applied NLP, a lot of time thinking about software design, a lot of time thinking about how to use ML to solve business problems, and a lot of time thinking about how to grow and nurture talent.
Over the years, I've invented and/or collected various smart and/or stupid ways to solve Fizz Buzz. In this book, there are ten solutions that I found particularly interesting, each inspiring a discussion of different aspects of coding, Python, testing, Fizz Buzz, math, software design, technical interviews, and related topics.
I make a small number of angelic investments. Almost without exception, these are companies where (1) I know the founders (2) I am passionate about the product + problem (3) I have a unique experience to contribute. If your startup is none of these, it's extremely unlikely that I'll invest. But I wish you good luck!
Data Science from scratch
Data scientist has been called “the sexiest job of the 21st century,” presumably by someone who has never visited a fire station. Nonetheless, data science is a hot and growing field, and it doesn’t take a great deal of sleuthing to find analysts breathlessly prognosticating that over the next 10 years, we’ll need billions and billions more data scientists than we currently have.
But what is data science? After all, we can’t produce data scientists if we don’t know what data science is. According to a Venn diagram that is somewhat famous in the industry, data science lies at the intersection.
Although I originally intended to write a book covering all three, I quickly realized that a thorough treatment of “substantive expertise” would require tens of thousands of pages. At that point, I decided to focus on the first two. My goal is to help you develop the hacking skills that you’ll need to get started doing data science. And my goal is to help you get comfortable with the mathematics and statistics that are at the core of data science.
This is a somewhat heavy aspiration for a book. The best way to learn hacking skills is by hacking on things. By reading this book, you will get a good understanding of the way I hack on things, which may not necessarily be the best way for you to hack on things. You will get a good understanding of some of the tools I use, which will not necessarily be the best tools for you to use. You will get a good understanding of the way I approach data problems, which may not necessarily be the best way for you to approach data problems. The intent (and the hope) is that my examples will inspire you to try things your own way. All the code and data from the book is available on GitHub to get you started.
Similarly, the best way to learn mathematics is by doing mathematics. This is emphatically not a math book, and for the most part, we won’t be “doing mathematics.” However, you can’t really do data science without some understanding of probability and statistics and linear algebra. This means that, where appropriate, we will dive into mathematical equations, mathematical intuition, mathematical axioms, and cartoon versions of big mathematical ideas. I hope that you won’t be afraid to dive in with me.
Throughout it all, I also hope to give you a sense that playing with data is fun, because, well, playing with data is fun! (Especially compared to some of the alternatives, like tax preparation or coal mining.)
Data Science Libraries, Frameworks, Modules, and Toolkits are Great for Doing Data Science, But They're Also a Good Way to Dive Into the Discipline Without Actually Understanding Data Science. In this Book, You'll Learn How Many of the Most Fundamental Data Science Tools and Algorithms Work By Implementing Them From Scratch. If You Have an Aptitude for Mathematics and Some Programming Skills, Author Joel Grus Will Help You Get Comfortable With the Math and Statistics At the Core of Data Science, And with Hacking Skills You Need to Get Started as a Data Scientist. Today's Messy Glut of Data Holds Answers to Questions No One's Ever Thought to Ask. This Book Provides You with the Know How to Dig Those Answers Out.
Data scientists are those who know more about statistics than computers, and more about computers than statistics. They are also those who can make sense of messy data. Data scientists are not necessarily statisticians, nor do they have to be PhDs. However, they should be able to understand what statistical methods are used and why they were chosen.
For example, the dating site Ok Cupid asks its members to answer hundreds of questions in order to match them with potential partners. However, it also analyzes these answers to determine whether they're asking about things that would indicate a person is interested in having sex with them on the first date. Similarly, Facebook asks users to provide information about themselves such as their hometowns and current locations, ostensibly so that people can easily find and connect with them. However, it also analyses this information to detect global migration patterns and where fans of different football teams reside. As a large retailer Target tracks your purchases and activities, both online and in store. And it uses this data to predictively model what types of products you might buy based on your past purchases.
Data Science From Scratch - First Principles
There are lots and lots of data science libraries, frameworks, modules, and toolkits that efficiently implement the most common (as well as the least common) data science algorithms and techniques. If you become a data scientist, you will become intimately familiar with NumPy, with scikit-learn, with pandas, and with a panoply of other libraries. They are great for doing data science. But they are also a good way to start doing data science without actually understanding data science.
In this book, we will be approaching data science from scratch. That means we’ll be building tools and implementing algorithms by hand in order to better understand them. I put a lot of thought into creating implementations and examples that are clear, well commented, and readable. In most cases, the tools we build will be illuminating but impractical. They will work well on small toy datasets but fall over on “web-scale” ones.
Throughout the book, I will point you to libraries you might use to apply these techniques to larger datasets. But we won’t be using them here. There is a healthy debate raging over the best language for learning data science. Many people believe it’s the statistical programming language R. (We call those people wrong.) A few people suggest Java or Scala. However, in my opinion, Python is the obvious choice.
I am hesitant to call Python my favorite programming language. There are other languages I find more pleasant, better designed, or just more fun to code in. And yet pretty much every time I start a new data science project, I end up using Python. Every time I need to quickly prototype something that just works, I end up using Python. And every time I want to demonstrate data science concepts in a clear, easy-to-understand way, I end up using Python. Accordingly, this book uses Python.
The goal of this book is not to teach you Python. (Although it is nearly certain that by reading this book you will learn some Python.) I’ll take you through a chapter-long crash course that highlights the features that are most important for our purposes, but if you know nothing about programming in Python (or about programming at all), then you might want to supplement this book with some sort of “Python for Beginners” tutorial.
Data science from scratch: first principles with Python PDF
The remainder of our introduction to data science will take this same approach— going into detail where going into detail seems crucial or illuminating, at other times leaving details for you to figure out yourself (or look up on Wikipedia). Over the years, I’ve trained a number of data scientists. While not all of them have gone on to become world-changing data ninja rockstars, I’ve left them all better data scientists than I found them. And I’ve grown to believe that anyone who has some amount of mathematical aptitude and some amount of programming skill has the necessary raw materials to do data science. All she needs is an inquisitive mind, a willingness to work hard, and this book. Hence this book.
Data science from scratch: Book Reviews
Rob R, Colorado
This book is intended for those who know practically nothing at all about data analysis and at the same time have at least minimal programming experience. The latter is important - although there is a short introduction to Python at the beginning of the book, it is too short, and if you are a complete beginner, it is unlikely to save you.
A good introduction to the basics of Data Science. Briefly, clearly and to the point. Please do not take this book as a comprehensive guide. Excellent translation + valuable comments from the translator. The quality of printing and illustrations is quite acceptable, better is not needed. You always have to pay more for the best, which is not the case.
It will be interesting from the point of view of an introduction to Python, a kind of introduction for advanced, for trained readers. Kickstart by example. The author shows the best aspects of the language, which make it possible not to bother with trifles, but to deal directly with the matter. The author seems to say all the time: "Look, what a difficult thing," - and immediately gives an illustrative example of implementation and sums up: "Cool, isn't it?".
Good afternoon! For me, this book turned out to be strange - what I already know is superficially revealed, nothing new, which I did not understand, remained a mystery
James2001, University of Ohio
At the beginning of the book, a basic course on the Python language is given, nothing superfluous, all the important features of the language. There are chapters on algebra, statistics, probability theory. It contains the information you need to understand.
Frequently used and very useful methods of processing, collecting and combining data are presented. Machine learning, which is very popular now, is considered. Provides some examples of analyzing and retrieving data from social networks.
With all the books on data science, this is a great foundational book. Examine the material extensively and in detail. It really helps non-Python programmers.
Overall, this is a good book. It covers a wide range of topics and is so detailed that the reader can easily implement the basic concepts without going into so much detail that they get stuck in little things.
However, it can be assumed that the reader is ready to tackle the subject. The reader is assumed to have a general functional knowledge of Python, statistics, and data concepts. I don't necessarily agree with the other reviewer that this is PhD-level material, but the reader should probably have a Bachelor of Science or CSci degree to fully appreciate this book.
R. Ray S.
A really good book. However, I wouldn't necessarily say from scratch. You will need a grounding in statistics, advanced algebra, Python, and perhaps a computer science degree while you are a programming intern.
Robert P. Sedor
Before taking one of the tomes on Scikit-Learning or Tensorflow and if you are not yet familiar with the principles of data science, then this is the book for you. There is some math, but once you get out of engineering school it won't be too difficult. Really good at concepts.
I teach to write
I was an academic data scientist decades before the term was coined and I love this book. I started SAS and S (the mother of R) over 25 years ago and brought some Python with me (as a general programming language), but never used it for data science before this book . As a statistician, data management and visualization expert, I am impressed by both the breadth and depth of the reports, especially since they are less than 400 pages. The text is clear from start to finish, but what it really shows are links to other resources. There are places where I would have liked to see more references to important statistics books, but what is suggested is good. If you want to get into data science and want to use Python instead of R, this is a great place to start.
At the beginning of this review let me say that I respect the author and the simple ploy of attempting to attack a PhD. Topic level, in one book. I've been a SW engineer for 15 years, with an engineering degree, and many years ago I did some research on GA and image segmentation ... and found this book challenging.
When I was in college, the Comp Sci joke was that it was really just "applied math". It wasn't much fun, but it's damn true, doubly true when it comes to data science, AI and ML, which are all fancy words for "applied statistics."
So the book offers a background on various topics such as statistics, probability theory, linear algebra, programming and touches on some topics of calculus. If you are not familiar with these things, you cannot learn them from one book or all at once. It's just too much.
The author leaves a lot of crumbs to copy. If you're not particularly good at any of the topics I've mentioned, then you should read some of the books he recommends (most are online).
After that, the book is a solid tour of data science, and surprisingly, there is quite a lot of material on AI / ML. Everything is well done. The only complaint I have is that I would have liked to have more equations besides code. I know equations scare people, but this is data science.
TLDR? This is a serious book for serious people. There are no easy answers, but it is a great education.
The book offers a good approach to covering the basic statistics found in an introductory data science course and includes Python exercises (and you should really know a little bit of Python to get started) to help you write functions to help you apply these. methods. By dealing with statistics and providing coding exercises, it will be easier for a reader to switch to more advanced statistics and programming.
This is a good book. It has been updated for Python 3 so you don't miss out on any of the new features.
It is definitely not for absolute beginners. If you have a basic understanding of Python, you will understand some of the concepts outlined.
Dr. Howard B. Bandy
The book begins with an overview of Python programming, followed by a crash course in Python. Then comes a recommendation of the Anaconda distribution, which is free, includes Python 2.7, NumPy, SciPY, MatPlotLib, and IPython. These are used throughout the book, and they are also included in the package. Pandas is also included, and we will use this to work with financial data. This is a good place to start if you are new to Python, because it gives you a solid foundation of the language.
Chapters 4,5,6 are quick reviews of linear Algebra and the Python data structures.
Chapter 7 discusses hypothesis testing and inference, and has nice discussions of the Beta Distribution and how it can be used to describe prior distributions.
Chapter 8 gets into the data science with descriptions of Gradient Descent and how it can be applied to find the set of parameters that maximize (or minimize).
I thoroughly enjoyed this program, one of my favorite programs ever on programming. It can do three things superbly: cover the basics of low level tools of a computer scientist (the "from start" part), give a great overview of useful python programming examples for those new, and give an amazingly concise yet high level overview of math and statistics required for data scientists. At first I thought this book would be too much for me because of the jokes throughout the text and how I thought it might keep up for the rest of it. But it didn't happen and it turned out to be a very reasonable way to get used to this complex topic.
The data in this chapter is like all the data everybody uses in their examples. Totally useless. Randomly generated digits, endless usage of the coin toss example, typical artificial data that no one analyzes on a daily bases. The book starts off by introducing you to a fictitious social networking platform for Data Science professionals. Which was very promising, and I was excited to see how this "personality" would handle the data problems they'd face... Spoiler, barely mentioned it. Most examples are rife with typical statistics 101 and randomly generated data, yet again, another disappointment on the end.
Although this book is designed to be an introduction, I actually purchased it as a second resource. I took online courses in data science and machine learning and did well. The problem was that the assignments were often very large and gave you little windows to insert coding into. Their code did the major lifting. They would pat you on the back for completing the assignment, but left you feeling unsatisfied because you couldn't do it yourself. That is why I picked up the book. I wanted a book that could help me get started writing code to solve simple problems on my own without having to rely on someone else's code.