Preface to the Second Edition
I am exceptionally proud of the first edition of Data Science from Scratch. It turned out very much the book I wanted it to be. But several years of developments in data science, of progress in the Python ecosystem, and of personal growth as a developer and educator have changed what I think a first book in data science should look like.
In life, there are no do-overs. In writing, however, there are second editions. Accordingly, I’ve rewritten all the code and examples using Python 3.6 (and many of its newly introduced features, like type annotations). I’ve woven into the book an emphasis on writing clean code. I’ve replaced some of the first edition’s toy examples with more realistic ones using “real” datasets. I’ve added new material on topics such as deep learning, statistics, and natural language processing, corresponding to things that today’s data scientists are likely to be working with. (I’ve also removed some material that seems less relevant.) And I’ve gone over the book with a fine-toothed comb, fixing bugs, rewriting explanations that are less clear than they could be, and freshening up some of the jokes.
The first edition was a great book, and this edition is even better. Enjoy! Joel Grus
Data Science from scratch
Data scientist has been called “the sexiest job of the 21st century,” presumably by someone who has never visited a fire station. Nonetheless, data science is a hot and growing field, and it doesn’t take a great deal of sleuthing to find analysts breathlessly prognosticating that over the next 10 years, we’ll need billions and billions more data scientists than we currently have.
But what is data science? After all, we can’t produce data scientists if we don’t know what data science is. According to a Venn diagram that is somewhat famous in the industry, data science lies at the intersection.
Although I originally intended to write a book covering all three, I quickly realized that a thorough treatment of “substantive expertise” would require tens of thousands of pages. At that point, I decided to focus on the first two. My goal is to help you develop the hacking skills that you’ll need to get started doing data science. And my goal is to help you get comfortable with the mathematics and statistics that are at the core of data science.
This is a somewhat heavy aspiration for a book. The best way to learn hacking skills is by hacking on things. By reading this book, you will get a good understanding of the way I hack on things, which may not necessarily be the best way for you to hack on things. You will get a good understanding of some of the tools I use, which will not necessarily be the best tools for you to use. You will get a good understanding of the way I approach data problems, which may not necessarily be the best way for you to approach data problems. The intent (and the hope) is that my examples will inspire you to try things your own way. All the code and data from the book is available on GitHub to get you started.
Similarly, the best way to learn mathematics is by doing mathematics. This is emphatically not a math book, and for the most part, we won’t be “doing mathematics.” However, you can’t really do data science without some understanding of probability and statistics and linear algebra. This means that, where appropriate, we will dive into mathematical equations, mathematical intuition, mathematical axioms, and cartoon versions of big mathematical ideas. I hope that you won’t be afraid to dive in with me.
Throughout it all, I also hope to give you a sense that playing with data is fun, because, well, playing with data is fun! (Especially compared to some of the alternatives, like tax preparation or coal mining.)
There are lots and lots of data science libraries, frameworks, modules, and toolkits that efficiently implement the most common (as well as the least common) data science algorithms and techniques. If you become a data scientist, you will become intimately familiar with NumPy, with scikit-learn, with pandas, and with a panoply of other libraries. They are great for doing data science. But they are also a good way to start doing data science without actually understanding data science.
In this book, we will be approaching data science from scratch. That means we’ll be building tools and implementing algorithms by hand in order to better understand them. I put a lot of thought into creating implementations and examples that are clear, well commented, and readable. In most cases, the tools we build will be illuminating but impractical. They will work well on small toy datasets but fall over on “web-scale” ones.
Throughout the book, I will point you to libraries you might use to apply these techniques to larger datasets. But we won’t be using them here. There is a healthy debate raging over the best language for learning data science. Many people believe it’s the statistical programming language R. (We call those people wrong.) A few people suggest Java or Scala. However, in my opinion, Python is the obvious choice.
I am hesitant to call Python my favorite programming language. There are other languages I find more pleasant, better designed, or just more fun to code in. And yet pretty much every time I start a new data science project, I end up using Python. Every time I need to quickly prototype something that just works, I end up using Python. And every time I want to demonstrate data science concepts in a clear, easy-to-understand way, I end up using Python. Accordingly, this book uses Python.
The goal of this book is not to teach you Python. (Although it is nearly certain that by reading this book you will learn some Python.) I’ll take you through a chapter-long crash course that highlights the features that are most important for our purposes, but if you know nothing about programming in Python (or about programming at all), then you might want to supplement this book with some sort of “Python for Beginners” tutorial.
Data science from scratch: first principles with Python PDF
The remainder of our introduction to data science will take this same approach— going into detail where going into detail seems crucial or illuminating, at other times leaving details for you to figure out yourself (or look up on Wikipedia). Over the years, I’ve trained a number of data scientists. While not all of them have gone on to become world-changing data ninja rockstars, I’ve left them all better data scientists than I found them. And I’ve grown to believe that anyone who has some amount of mathematical aptitude and some amount of programming skill has the necessary raw materials to do data science. All she needs is an inquisitive mind, a willingness to work hard, and this book. Hence this book.
Data science from scratch: Book Reviews
Rob R, Colorado
This book is intended for those who know practically nothing at all about data analysis and at the same time have at least minimal programming experience. The latter is important - although there is a short introduction to Python at the beginning of the book, it is too short, and if you are a complete beginner, it is unlikely to save you.
A good introduction to the basics of Data Science. Briefly, clearly and to the point. Please do not take this book as a comprehensive guide. Excellent translation + valuable comments from the translator. The quality of printing and illustrations is quite acceptable, better is not needed. You always have to pay more for the best, which is not the case.
It will be interesting from the point of view of an introduction to Python, a kind of introduction for advanced, for trained readers. Kickstart by example. The author shows the best aspects of the language, which make it possible not to bother with trifles, but to deal directly with the matter. The author seems to say all the time: "Look, what a difficult thing," - and immediately gives an illustrative example of implementation and sums up: "Cool, isn't it?".
Good afternoon! For me, this book turned out to be strange - what I already know is superficially revealed, nothing new, which I did not understand, remained a mystery
James2001, University of Ohio
At the beginning of the book, a basic course on the Python language is given, nothing superfluous, all the important features of the language. There are chapters on algebra, statistics, probability theory. It contains the information you need to understand.
Frequently used and very useful methods of processing, collecting and combining data are presented. Machine learning, which is very popular now, is considered. Provides some examples of analyzing and retrieving data from social networks.
With all the books on data science, this is a great foundational book. Examine the material extensively and in detail. It really helps non-Python programmers.
Overall, this is a good book. It covers a wide range of topics and is so detailed that the reader can easily implement the basic concepts without going into so much detail that they get stuck in little things.
However, it can be assumed that the reader is ready to tackle the subject. The reader is assumed to have a general functional knowledge of Python, statistics, and data concepts. I don't necessarily agree with the other reviewer that this is PhD-level material, but the reader should probably have a Bachelor of Science or CSci degree to fully appreciate this book.
R. Ray S.
A really good book. However, I wouldn't necessarily say from scratch. You will need a grounding in statistics, advanced algebra, Python, and perhaps a computer science degree while you are a programming intern.
Robert P. Sedor
Before taking one of the tomes on Scikit-Learning or Tensorflow and if you are not yet familiar with the principles of data science, then this is the book for you. There is some math, but once you get out of engineering school it won't be too difficult. Really good at concepts.
I teach to write
I was an academic data scientist decades before the term was coined and I love this book. I started SAS and S (the mother of R) over 25 years ago and brought some Python with me (as a general programming language), but never used it for data science before this book . As a statistician, data management and visualization expert, I am impressed by both the breadth and depth of the reports, especially since they are less than 400 pages. The text is clear from start to finish, but what it really shows are links to other resources. There are places where I would have liked to see more references to important statistics books, but what is suggested is good. If you want to get into data science and want to use Python instead of R, this is a great place to start.
At the beginning of this review let me say that I respect the author and the simple ploy of attempting to attack a PhD. Topic level, in one book. I've been a SW engineer for 15 years, with an engineering degree, and many years ago I did some research on GA and image segmentation ... and found this book challenging.
When I was in college, the Comp Sci joke was that it was really just "applied math". It wasn't much fun, but it's damn true, doubly true when it comes to data science, AI and ML, which are all fancy words for "applied statistics."
So the book offers a background on various topics such as statistics, probability theory, linear algebra, programming and touches on some topics of calculus. If you are not familiar with these things, you cannot learn them from one book or all at once. It's just too much.
The author leaves a lot of crumbs to copy. If you're not particularly good at any of the topics I've mentioned, then you should read some of the books he recommends (most are online).
After that, the book is a solid tour of data science, and surprisingly, there is quite a lot of material on AI / ML. Everything is well done. The only complaint I have is that I would have liked to have more equations besides code. I know equations scare people, but this is data science.
TLDR? This is a serious book for serious people. There are no easy answers, but it is a great education.
The book offers a good approach to covering the basic statistics found in an introductory data science course and includes Python exercises (and you should really know a little bit of Python to get started) to help you write functions to help you apply these. methods. By dealing with statistics and providing coding exercises, it will be easier for a reader to switch to more advanced statistics and programming.
This is a good book. It has been updated for Python 3 so you don't miss out on any of the new features.
It is definitely not for absolute beginners. If you have a basic understanding of Python, you will understand some of the concepts outlined.