R vs. Python for Data Science: Summary of Modern Advances (2024)

Recently, some of our readers have been asking us about the best programming language for data science. Immediately, R and Python both come to mind… but which of these two giants to choose?

We felt that this was a good time to address this question because we recently watched an excellentpresentation on recent advances of both languagesby Eduardo Ariño de la Rubia, the Chief Data Scientist at Domino Data Lab.

The main reason we liked the video is because it shows how bothPython and Rhave progressed so far. Both languages have become well rounded for data science.

Some people point to traditionalweaknesses of each language (e.g. data visualization in Python or data wrangling in R), but thanks to recent packageslike Altair forPython and dplyr forR, those weaknesses have been alleviated.

This post is a summary of the modern advances discussed in the video. We recommend watching the full video at their blog, but you can use this page to findlinks to each library mentioned.

We have 2 main goals for this post:

  1. For experienced data scientists, we hope to introduce you to a library or two that solves an annoying or painful problem you’re currently facing in your chosen language.
  2. For beginner data scientists, we want you to introduce you to all the great work that’s going into both languages so you can feel at ease withthe one you chose.

Finally, at the end of this post, we’ll provideour recommendations for the best language to start with depending on your background and your goals.

First, here is the summary from the presentation:

The Case for Python

Key quote: “I have this hope that there is a better way. Higher-level tools that actually let you see the structure of the software more clearly will be of tremendous value.” – Guido van Rossum

Guido van Rossum was the creator of the Python programming language.

Why Python is Great for Data Science

  • Python was released in 1989. It has been around for a long time, and it has object-oriented programming baked in.
  • IPython / Jupyter’s notebook IDE is excellent.
  • There’s a large ecosystem. For example, Scikit-Learn’s page receives150,000 – 160,000 unique visitors per month.
  • There’s Anaconda from Continuum Analytics, making package management very easy.
  • The Pandas library makes it simpleto work with data frames and time series data.

Advances in Modern Python for Data Science

1. Collecting Data

Feather (Fast reading and writing of data to disk)

  • Fast, lightweight, easy-to-use binary format for filetypes
  • Makes pushing data frames in and out of memory as simply as possible
  • Language agnostic (works across Python and R)
  • High read and write performance (600 MB/s vs 70 MB/s of CSVs)
  • Great for passing data from one language to another in your pipeline

Ibis (Pythonic way of accessing datasets)

  • Bridges the gap between local Python environments and remote storages like Hadoop or SQL
  • Integrates with the rest of the Python ecosystem

ParaText (Fastest way to get fixed records and delimited data off of disk and into RAM)

  • C++ library for reading text files in parallel on multi-core machines
  • Integrates with Pandas:paratext.load_csv_to_pandas("data.csv")
  • Enables CSV reading of up to 2.5GB a second
  • A bit difficult to install

bcolz(Helps you deal with data that’s larger than your RAM)

  • Compressed columnar storage
  • You have the ability to define a Pandas-like data structure, compress it, and store it in memory
  • Helps get around the performance bottleneck of querying from slower memory

2. Data Visualization

Altair (Like a Matplotlib2.0 that’s much more user friendly)

  • You can spend more time understanding your data and its meaning.
  • Altair’s API is simple, friendly and consistent.
  • Create beautiful and effective visualizations with a minimal amount of code.
  • Takes a tidy DataFrame as the data source.
  • Data is mapped to visual properties using the group-by operation of Pandas and SQL.
  • Primarily for creating static plots.

Bokeh (Reusable components for the web)

  • Interactive visualization library that targets modern web browsers for presentation.
  • Able to embed interactive visualizations.
  • D3.js for Python, except better.
  • Already has a big gallery that you can borrowsteal from.

Geoplotlib (Interactive maps)

  • Extremely clean and simple way to create maps.
  • Can take a simple list of names, latitudes, and longitudes as input.

3. Cleaning & Transforming Data

Blaze (NumPy for big data)

  • Translates a NumPy / Pandas-like syntax to data computing systems.
  • The samePython code canquery data across a variety of data storage systems.
  • Good way to future-proof your data transformations and manipulations.

xarray (Handles n-dimensional data)

  • N-dimensional arrays of core pandas data structures (e.g. if the data has a time component as well).
  • Multi-dimensional Pandas dataframes.

Dask (Parallel computing)

  • Dynamic task scheduling system.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments.

4. Modeling

Keras (Simple deep learning)

  • Higher level interface for Theano and Tensorflow
  • We wrotea complete Keras tutorial for beginners

PyMC3 (Probabilistic programming)

  • Contains the most high end research from labs in academia
  • Powerful Bayesian statistical modeling

Do you want to see tutorials for any of these libraries? Leave a comment below to let us know which ones!

The Case for R

Key quote: “There should be an interface to the very best numerical algorithms available.” – John Chambers

John Chambers actually created S, the precursor to R, but the spirit of R is the same.

Why Ris Great for Data Science

  • R was created in 1992, after Python, and was therefore able to learn from Python’s lessons.
  • Rcpp makesit very easy to extend R with C++.
  • RStudio is a mature and excellent IDE.
  • (Our note) CRAN is a candyland filled withmachine learning algorithms and statistical tools.
  • (Our note) The Caret package makes it easy to use different algorithms from 1 single interface, much like what Scikit-Learn has done for Python

Advances in Modern Rfor Data Science

1. Collecting Data

Feather(Fast reading and writing of data to disk)

  • Same as for Python

Haven (Interacts with SAS, Stata, SPSS data)

  • Reads SAS and brings it into a dataframe

Readr(Reimplements read.csv into something better)

  • read.csv sucks because it takes strings into factors, it’s slow, etc
  • Creates a contract for what the data features should be, making it more robust to use in production
  • Much faster than read.csv

JsonLite(Handles JSON data)

  • Intelligently turns JSON into matrices or dataframes

2. Data Visualization

ggplot2 (ggplot2 was recently massively upgraded)

  • Recently had a very significantupgrade (to the point where old code will break)
  • You can do faceting and zoom into facets

htmlwidgets (Reusable components)

  • Brings of the best of JavaScript visualization to R
  • Has a fantastic gallery you can borrow steal from

Leaflet (Interactive maps for the web)

  • Nice Javascript maps that you can embed in web applications

Tilegramsr (Proportional maps)

  • Create maps that are proportional to the population
  • Makes it possible to create more interesting maps than those that only highlight major cities due to population density

3. Cleaning & Transforming Data

Dplyr (Swiss army chainsaw)

  • The way R should’ve been from the first place
  • Has a bunch of amazing joins
  • Makes data wrangling much more humane

Broom (Tidy your models)

  • Fixes model outputs (gets around the weird incantations needed to see model coefficients)
  • tidy, augment, glance

Tidy_text (Text as tidy data)

  • Text mining using dplyr, ggplot2, and other tidy tools
  • Makes natural language processing in R much easier

4. Modeling

MXNet (Simple deep learning)

  • Intuitive interface for building deep neural networks in R
  • Not quite as nice as Keras

TensorFlow

  • Now has an interface in R

Do you want to see tutorials for any of these libraries? Leave a comment below to let us know which ones!

OurRecommendation

As you can see, both languages are actively being developed and have an impressive suite of tools already. It sounds cliché to say this, but there’s really no one-size-fits-all answer.

If you’re just starting out, one simple way to choose would be based on your comfort zone. For example, if you come from a C.S./developer background, you’ll probably feel more comfortable with Python. On the other hand, if you come from a statistics/analyst background, R will likely be more intuitive.

At EliteDataScience, we do love R, but we more often prefer to use Python. Python is a general-purpose programming language, making it possibleto do pretty much anything you want to do.

Python also has the wonderful Keras package, as mentioned above, making it a breeze to get started with deep learning.

If you’d like to learn Python for Data Science, we recommend checking out our free guide:

  • How to Learn Python for Data Science, The Self-Starter Way
R vs. Python for Data Science: Summary of Modern Advances (2024)

FAQs

Is it better to learn Python or R for data science? ›

If your goal is to pick up computer programming more broadly, Python is the way to go. If your goal is to focus purely on statistics and data applications, R might have the edge. To decide whether to start learning Python or R first, ask yourself a few questions: What are your career goals?

Can I become data scientist with R or do I need Python? ›

Python and R are the two most popular programming languages for data science. Both languages are well suited for any data science tasks you may think of.

Is Python overtaking R? ›

Both the languages have their own importance but they differ in some instances like readability, performance and many more. According to KDNuggets Data Science Survey Python has overtaken R in recent years because of its popularity.

Is R enough for data analysis? ›

Python and R are both excellent languages for data. They're also both appropriate for beginners with no previous coding experience. Luckily, no matter which language you choose to pursue first, you'll find a wide range of resources and materials to help you along the way.

Can Python do everything R can? ›

R can't be used in production code because of its focus on research, while Python, a general-purpose language, can be used both for prototyping and as a product itself. Python also runs faster than R, despite its GIL problems.

Is Python enough to become data scientist? ›

As one of the most popular data science programming languages, Python is an incredibly helpful tool with a variety of applications in the field. To succeed in this field, devs have to understand not only Python as a language itself, but also its frameworks, tools, and other skills associated with the field.

Is R becoming obsolete? ›

The truth is, R is far from dead. While it's true that Python has gained significant traction in recent years, R remains a powerful language that offers unique benefits for data scientists. One of the critical advantages of R is its focus on statistics and data visualization.

Is R still relevant in 2024? ›

Performing statistical analysis in R is a valuable skill for aspiring data analysts to learn in 2024. R provides a wide range of functions and packages that make it easier to prepare data and perform complex analyses.

Which is more in demand, R or Python? ›

Popularity of R vs Python

Python currently supports 15.7 million worldwide developers while R supports fewer than 1.4 million. This makes Python the most popular programming language out of the two. The only programming language that outpaces Python is JavaScript, which has 17.4 million developers.

What percent of data scientists use R? ›

Of the data professionals who identified as a data scientist, 93% used Python, 57% used SQL and 41% used R. Comparing program languages usage from 2018, we see that usage of Python has increased 4 percentage points (83% used in 2018) SQL usage remained the same (40% used in 2018).

What is the disadvantage of using R as a data analytics tool? ›

One of the main disadvantages of R is its steep learning curve. R has a unique and sometimes inconsistent syntax and logic that can be confusing and frustrating for beginners and even experienced users. R also requires a lot of coding and manual work that other software can do more easily and intuitively.

Is R or Python better for finance? ›

R: R is mostly used by data scientists as it is used only for data analysis. But compared to Python, it has been outraced. As finance involves the calculation and analysis of data R would be best for you. Python: Python is being used in almost all industries for data science, machine learning, and developing.

Is Python more in demand than R? ›

Popularity of R vs Python

Python currently supports 15.7 million worldwide developers while R supports fewer than 1.4 million. This makes Python the most popular programming language out of the two. The only programming language that outpaces Python is JavaScript, which has 17.4 million developers.

Is R programming necessary for data science? ›

R is heavily utilized in data science applications for ETL (Extract, Transform, Load). It provides an interface for many databases like SQL and even spreadsheets. R also provides various important packages for data wrangling.

Is Python or SQL better for data science? ›

SQL can be used for basic operations, but Python is generally preferred for data manipulation: libraries like NumPy or pandas contain most of the functions you need. Once you have cleaned and manipulated your data, you can visualize it!

Top Articles
Latest Posts
Article information

Author: Carmelo Roob

Last Updated:

Views: 6100

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Carmelo Roob

Birthday: 1995-01-09

Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

Phone: +6773780339780

Job: Sales Executive

Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.