Data science is an interdisciplinary field where scientific techniques from statistics, mathematics, and computer science are used to analyze data and solve problems more accurately and effectively. It is no wonder, then, that languages such as R and Python, with their extensive packages and libraries that support statistical methods and machine learning algorithms are cornerstones of the data science revolution. Often times, beginners find it hard to decide which language to learn first. This guide will help you make that decision.
Let’s take a look at the languages.
R is an open-source, statistical computing language that was built in 1995 by Ross Ihaka and Robert Gentleman. It was created with the intention of making data analysis, statistical models and graphical models easier. R has a large repository of packages called CRAN that users routinely contribute to. One of R’s main strengths is that it has a very active community that provides ample support to users via mailing lists, StackOverFlow forums, and very extensive documentation of all its packages. R has a slightly quirky syntax which can be hard to pick up for beginners but is especially suited for people from a statistical and research background looking to get started with creating their models quickly.
Python is a high level, interpreted, general purpose language that was built in 1991 by Guido Van Rossem to improve programmer productivity and code readability. It is usually the preferred language for programmers and people with a computer science background looking to get into data analysis. It is a very flexible language, making it great for production level work and, like R, has libraries of packages around statistics and machine learning in PyPi, the repository of Python packages. It has great community support, although being a general purpose language it is not all concentrated around data science.
The biggest advantage to using Python is the availability of packages such as Theano, Keras, scikit-learn that are important machine learning and deep learning libraries used by both academic research purposes as well as for commercial intent.
Choosing the right language
As professional problem solvers, data science practitioners need to have a versatile set of tools as part of their repertoire. While learning both R and Python is ideal, given that R makes data cleaning and manipulation a very easy task while Python is better for building models on larger data sets and scale, we all have to begin somewhere. And the right choice for you can be determined by the following factors - previous programming experience, educational background, career aspirations, and interest in working with deep learning technology.
Previous Programming Experience
If you have any programming experience prior to learning data science, our recommendation would be for you to learn Python. Its clear syntax would be easy for you to take up; and with it being a general purpose language, you’d have the added flexibility for building novel stuff. Even a complete novice is advised to learn Python, as it is one of the most beginner friendly languages in Computer Science, being the most popular introductory teaching language in the top U.S. universities (Communications of the ACM article, 2014). R code gets to the point more quickly and is less verbose as well, but it has a quirky syntax that would be difficult to learn for both hardcore programmers and beginners alike. We recommend this course for those interested in learning Python programming.
Having a background in statistics or mathematics makes R a better choice for you. This is because R is a domain specific language created specifically for statistics, making its usage intuitive for people with a degree in statistics. R was created by statisticians and made with other statisticians in mind, so having a grasp of statistical analysis makes the transition into this language all the more easy.
As a data analyst/business analyst/financial analyst, your focus would be on extracting the most information out of your data, without needing to create a product out of your content. For this reason, learning R and a database language like SQL would serve you better as R is great for working with tabular data on a single system/server and has great libraries like ggplot2 for easy visualizations.
But a data scientist has different requirements, as they’re expected to carry out analysis as well as create products such as machine learning engines that work on the database of a website or a software. This would require both software development as well as predictive modelling work which can be better accomplished by a general purpose language like Python. These principles would apply across all industries.
Interest in Deep Learning
Deep Learning is the trending topic du jour and anyone with an interest in contributing to the growth of artificial intelligence technology should be learning Python. Its overwhelming popularity for both machine learning, as well as deep learning, comes from the fact that Python acts as an interface between the programmer and lower level languages like C/C++, this making it very easy for experimenting, creating models and debugging without compromising on computational speed (as the machine uses C/C++and CUDA technology to build the models). This makes Python a very accessible language for mathematicians and statisticians looking to create neural network models without having to start creating them from scratch due to the pre-existing frameworks provided by Python.
As you can see, the deeper you wish to get into data science and machine learning, the more it makes sense for your to opt for Python, though R has its own advantages as well. Ultimately, having a thorough understanding of both, each language’s limitations and strengths is the best approach to learning these two unique languages. With that said, we suggest data science enthusiasts make a choice that’s suitable for their needs and aspirations.