Reminds me of how thinking using frequencies rather than computing probabilities is easier and can avoid errors (e.g. a 99% accurate test being positive does not mean 99% likelihood of having disease for a disease with a 1/10,000 prevalence in population).
These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).
In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.
Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.
You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.
Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.
Pandas is widely adopted and deeply integrated into the Python ecosystem. Meanwhile, Polars remains a small niche, and it's one of those hype technologies that will likely be dead in 3 years once most of its users realise that it offers them no actual practical advantages over Pandas.
If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.
Polars is trying to solve an issue that just doesn't exist for the vast majority of users.
Arguably Spark solves a problem that does not exist anymore: single node performance with tools like DuckDB and Polars is so good that there’s no need for more complex orchestration anymore, and these tools are sufficiently user-friendly that there is little point to switching to Pandas for smaller datasets.
> Pandas is widely adopted and deeply integrated into the Python ecosystem.
This is pretty laughable. Yes there are very DS specific tools that make good use of Pandas, but `to_pandas` in Polars trivially solves this. The fact that Pandas always feels like injecting some weird DSL into existing Python code bases is one of the major reasons why I really don't like it.
> If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.
Have you used Polars at all? Or for that matter written significant Pandas outside of a notebook? The number one benefit of Polars, imho, is that Polars works using Expressions that allow you to trivially compose and reuse fundamental logic when working with data in a way the works well with other Python code. This solves the biggest problem with Pandas is that it does not abstract well.
Not to mention that Pandas is really poor dataframe experience outside of it's original use case which was financial time series. The entire multi-index experience is awful and I know that either you are calling 'reset_index' multiple times in your Pandas logic or you have bugs.
"Data Science" has never been related to academic research, it has always emerged in a business context. I wouldn't say that researchers at Deep Mind are "data scientists", they are academic researchers who focus on shipping papers. If you're in a pure research environment, nobody cares if you write everything in Matlab.
But the last startup I was at tried to take a similar approach to research was unable to ship a functioning product and will likely disappear in a year from now. FAIR has been largely disbanded in favor of the way more shipping-centric MSL, and the people I know at Deep Mind are increasingly finding themselves under pressure to actually produce things.
Since you've been hanging out in an ivory tower then you might be unaware that during the peek DS frenzy (2016-2019) there were companies where data scientists were allowed to live entirely in notebooks and it was someone else's problem to ship their notebooks. Today if you have that expectation you won't last long at most companies, if you can even find a job in the first place.
On top of that, I know quite a few people at the major LLM teams and, based on my conversations, all of them are doing pretty serious data engineering work to get things shipped even if they were hired for there modeling expertise. It's honestly hard to even run serious experiments at the scale of modern day LLMs without being pretty proficient at data engineering related tasks.
I have not work with Polars, but I would imagine any incompatibility with existing libraries (e.g. plotting libraries like plotnine, bokeh) would quickly put me off.
It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.
This is a non-issue with Polars dataframes to_pandas() method. You get all the performance of Polars for cleaning large datasets, and to_pandas() gives you backwards compatibility with other libraries. However, plotnine is completely compatible with Polars dataframe objects.
Pandas is generally awful unless you're just living in a notebook (and even then it's probably least favorite implementation of the 'data frame' concept).
Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.
Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.
Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.
What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?
I initially considered using Pandas to work with community collections of Elite: Dangerous game data, specifically those published first by EDDB (RIP) and now by Spansh. However, I quickly hit the maximum process memory limits because my naïve attempts at manipulating even the smallest of those collections resulted in Pandas loading GB-scale JSON data files into RAM. I'm intrigued by Polars stated support for data streaming. More professionally, I support the work of bioinformaticians, statisticians, and data scientists, so I like to stay informed.
I like how in Pandas (and in R), I can quickly load data sets up in a manner that lets me do relational queries using familiar syntax. For my Elite: Dangerous project, because I couldn't get Pandas to work for me (which the reader should chalk up to my ignorance and not any deficiency of Pandas itself), I ended up using the SQLAlchemy ORM with Marshmallow to load the data into SQLite or PostgreSQL. Looking back at the work, I probably ought to have thrown it into a JSON-aware data warehouse somehow, which I think is how the guy behind Spansh does it, but I'm not a big data guy (yet) and have a lot to learn about what's possible.
R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.
The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.
I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.
The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.
Mostly what's going on with Matlab in the wild is that it costs at least $10k a seat as soon as you are no longer at an academic institution.
Yes, there is Octave but often the toolboxes aren't available or compatible so you're rewriting everything anyway. And when you start rewriting things for Octave you learn/remember what trash Matlab actually is as a language or how big a pain doing anything that isn't what Mathworks expects actually is.
To be fair: Octave has extended Matlab's syntax with amazing improvements (many inspired by numpy and R). It really makes me angry that Mathworks hasn't stolen Octave's innovations and I hate every minute of not being able to broadcast and having to manually create temp variables because you can't chain indexing whenever I have to touch actual Matlab. So to be clear Octave is somewhat pleasant and for pure numerical syntax superior to numpy.
But the siren call of Python is significant. Python is not the perfect language (for anything really) but it is a better-than-good language for almost everything and it's old enough and used by so many people that someone has usually scratched what's itching already. Matlab's toolboxes can't compete with that.
The pandas workflows have also been stable for the last decade. That there is a new kid on the block (polars) does not make the existing stuff any less stable. And one can just continue writing pandas for the next decade too.
I love R, but how can you make that claim when R uses three distinct object-oriented systems all at the same time? R might seem stable only because it carries along with it 50 years of history of programming languages (part of it's charm, where else can you see the generic function approach to OOP in a language that's still evolving?)
Finally, as someone who wrote a lot of R pre-tidyverse, I've seen the entire ecosystem radically change over my career.
I honestly don't get why you'd hate pandas more than anything else in the Python ecosystem. It's probably not the best tool in the world, and sure, like everybody else I'd rewrite the universe in Rust if I could start over, and had infinite time to catch up.
But the code base I work on has thousands and THOUSANDS of lines of Pandas churning through big data, and I can't remember the last time it lead to a bug or error in production.
We use pandas + static schema wrapper + type checker, so you'll have to get exotic to break things.
The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."
Pandas is cancer. Please stop teaching it to people.
Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).
Pandas code is untestable, unreadable, hard to refactor and impossible to reuse.
Trillions of dollars are wasted every year by people having to rewrite pandas code.
Code using pandas is testable and reusable in much the same way as any other code, make functions that take and return data.
That said, the polars/narwals style API is better than pandas API for sure. More readable and composable, simpler (no index) and a bit less weird overall.
Polars made the mistake of not maintaining row order for all operations, via the False-by-default argument of maintain_order. This is basically the billion-dollar null mistake for data frames.
Yeah that really should have been default. Very big footgun, especially when preserving ordering is default in pandas, numpy, etc. And especially when there is no ingrained index concept in polars, people might very well forget that one needs to have some natural keys and not rely on ordering. One needs to bring more of an SQL mindset.
> Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).
I see this take somewhat often, and usually with similar lack of nuance. How do you come to this? In other cases where I've seen this it's from people who haven't worked in any context where performance or scientific computing ecosystem interoperability matters - missing a massive part of the picture. I've struggled to get through to them before. Genuine question.
I've recently had to migrate over to Python from Matlab. Pandas has been doing my head in. The syntax is so unintuitive. In Matlab, everything begins with a `for` loop. Inelegant and slow, yes, but easy to reason about. Easy to see the scope and domain of the problem, to visualise the data wrangling.
Pandas insist you never use a for loop. So, I feel guilty if I ever need a throwaway variable on the way to creating a new column. Sometimes methods are attached to objects, other times they aren't. And if you need to use a function that isn't vectorised, you've got to do df.apply anyway. You have to remember to change the 'axis' too. Plotting is another thing that I can't get my head around. Am I supposed to use Pandas' helpers like df.plot() all the time? Or ditch it and use the low level matplotlib directly? What is idiomatic? I cannot find answers to much of it, even with ChatGPT. Worse, I can't seem to create a mental model of what Pandas expects me to do in a given situation.
Pandas has disabused me of the notion that Python syntax is self-explanatory and executable-pseudocode. I find it terrible to look at. Matlab was infinitely more enjoyable.
Yeah, pandas is truly awful. After working with things like R, ggplot, data.table, you soon realize pandas is the worst dataframe analysis and plotting library out there.
I pretty much consider anyone who likes it to have Stockholm syndrome.
I found Pandera quite good for wrapping input/output expectations over Pandas. At the end of the day the vectorisation of operations in it and other table based formats mean they’re not easy to replace performantly.
Can you write more about this? A lot of people use pandas where I work, whereas I'm completely fluent in list comprehensions and dataclasses etc. I had the impression it was doing something "more" like using numpy arrays/matrices for columns.
I loved his Statistics for Hackers talk: https://speakerdeck.com/pycon2016/jake-vanderplas-statistics...
Amazing Thank you for sharing.
Reminds me of how thinking using frequencies rather than computing probabilities is easier and can avoid errors (e.g. a 99% accurate test being positive does not mean 99% likelihood of having disease for a disease with a 1/10,000 prevalence in population).
These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).
In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.
This book was absolute fire for getting started with data science in 2017-2018, Jake is a great teacher.
Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.
It was originally published in 2016, and I think this is still the first edition.
Looks like it. From https://jakevdp.github.io/PythonDataScienceHandbook/00.00-pr...:
> Copyright 2016
why? It's the industry standard as far as my reach goes.
What other framework would you replace it with?
No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.
Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.
Pandas is widely adopted and deeply integrated into the Python ecosystem. Meanwhile, Polars remains a small niche, and it's one of those hype technologies that will likely be dead in 3 years once most of its users realise that it offers them no actual practical advantages over Pandas.
If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.
Polars is trying to solve an issue that just doesn't exist for the vast majority of users.
Arguably Spark solves a problem that does not exist anymore: single node performance with tools like DuckDB and Polars is so good that there’s no need for more complex orchestration anymore, and these tools are sufficiently user-friendly that there is little point to switching to Pandas for smaller datasets.
> Pandas is widely adopted and deeply integrated into the Python ecosystem.
This is pretty laughable. Yes there are very DS specific tools that make good use of Pandas, but `to_pandas` in Polars trivially solves this. The fact that Pandas always feels like injecting some weird DSL into existing Python code bases is one of the major reasons why I really don't like it.
> If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.
Have you used Polars at all? Or for that matter written significant Pandas outside of a notebook? The number one benefit of Polars, imho, is that Polars works using Expressions that allow you to trivially compose and reuse fundamental logic when working with data in a way the works well with other Python code. This solves the biggest problem with Pandas is that it does not abstract well.
Not to mention that Pandas is really poor dataframe experience outside of it's original use case which was financial time series. The entire multi-index experience is awful and I know that either you are calling 'reset_index' multiple times in your Pandas logic or you have bugs.
> once most of its users realise that it offers them no actual practical advantages over Pandas
What? Speed and better nested data support (arrays/JSON) alone are extremely useful to every data scientist.
My produtivity skyrocketed after switching from pandas to polars.
>Today DS work will ultimately become data engineering work anyway.
Oh yeah? Well in my ivory tower the work stops being serious once it becomes engineering, how do you like that elitism?!
"Data Science" has never been related to academic research, it has always emerged in a business context. I wouldn't say that researchers at Deep Mind are "data scientists", they are academic researchers who focus on shipping papers. If you're in a pure research environment, nobody cares if you write everything in Matlab.
But the last startup I was at tried to take a similar approach to research was unable to ship a functioning product and will likely disappear in a year from now. FAIR has been largely disbanded in favor of the way more shipping-centric MSL, and the people I know at Deep Mind are increasingly finding themselves under pressure to actually produce things.
Since you've been hanging out in an ivory tower then you might be unaware that during the peek DS frenzy (2016-2019) there were companies where data scientists were allowed to live entirely in notebooks and it was someone else's problem to ship their notebooks. Today if you have that expectation you won't last long at most companies, if you can even find a job in the first place.
On top of that, I know quite a few people at the major LLM teams and, based on my conversations, all of them are doing pretty serious data engineering work to get things shipped even if they were hired for there modeling expertise. It's honestly hard to even run serious experiments at the scale of modern day LLMs without being pretty proficient at data engineering related tasks.
> No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
Can you expand on why Polars isn't optimised for a holistic approach to data science?
I have not work with Polars, but I would imagine any incompatibility with existing libraries (e.g. plotting libraries like plotnine, bokeh) would quickly put me off.
It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.
This is a non-issue with Polars dataframes to_pandas() method. You get all the performance of Polars for cleaning large datasets, and to_pandas() gives you backwards compatibility with other libraries. However, plotnine is completely compatible with Polars dataframe objects.
You can always convert from Polars to Pandas. Plotnine will do it automatically for you, even.
What can you do in more easily in pandas than polars?
The book is quite old actually, not sure if "this day and age" still applies to it
What's wrong with Pandas?
Pandas is generally awful unless you're just living in a notebook (and even then it's probably least favorite implementation of the 'data frame' concept).
Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.
Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.
Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.
What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?
I initially considered using Pandas to work with community collections of Elite: Dangerous game data, specifically those published first by EDDB (RIP) and now by Spansh. However, I quickly hit the maximum process memory limits because my naïve attempts at manipulating even the smallest of those collections resulted in Pandas loading GB-scale JSON data files into RAM. I'm intrigued by Polars stated support for data streaming. More professionally, I support the work of bioinformaticians, statisticians, and data scientists, so I like to stay informed.
I like how in Pandas (and in R), I can quickly load data sets up in a manner that lets me do relational queries using familiar syntax. For my Elite: Dangerous project, because I couldn't get Pandas to work for me (which the reader should chalk up to my ignorance and not any deficiency of Pandas itself), I ended up using the SQLAlchemy ORM with Marshmallow to load the data into SQLite or PostgreSQL. Looking back at the work, I probably ought to have thrown it into a JSON-aware data warehouse somehow, which I think is how the guy behind Spansh does it, but I'm not a big data guy (yet) and have a lot to learn about what's possible.
I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.
R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.
The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.
I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.
The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.
Mostly what's going on with Matlab in the wild is that it costs at least $10k a seat as soon as you are no longer at an academic institution.
Yes, there is Octave but often the toolboxes aren't available or compatible so you're rewriting everything anyway. And when you start rewriting things for Octave you learn/remember what trash Matlab actually is as a language or how big a pain doing anything that isn't what Mathworks expects actually is.
To be fair: Octave has extended Matlab's syntax with amazing improvements (many inspired by numpy and R). It really makes me angry that Mathworks hasn't stolen Octave's innovations and I hate every minute of not being able to broadcast and having to manually create temp variables because you can't chain indexing whenever I have to touch actual Matlab. So to be clear Octave is somewhat pleasant and for pure numerical syntax superior to numpy.
But the siren call of Python is significant. Python is not the perfect language (for anything really) but it is a better-than-good language for almost everything and it's old enough and used by so many people that someone has usually scratched what's itching already. Matlab's toolboxes can't compete with that.
The pandas workflows have also been stable for the last decade. That there is a new kid on the block (polars) does not make the existing stuff any less stable. And one can just continue writing pandas for the next decade too.
I love R, but how can you make that claim when R uses three distinct object-oriented systems all at the same time? R might seem stable only because it carries along with it 50 years of history of programming languages (part of it's charm, where else can you see the generic function approach to OOP in a language that's still evolving?)
Finally, as someone who wrote a lot of R pre-tidyverse, I've seen the entire ecosystem radically change over my career.
Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.
Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.
Ha, I think that happens regardless of the tech you use. Just blame time.
Nothing, it gets the job done for most people. If you don't like it, make a better tool. Polars is not it.
I used the Kernel Density Estimation (KDE) page/blog at my very first job. It was immensely useful and I've loved his work ever since.
https://learningds.org/intro.html Cc-by-nc-nd
I honestly don't get why you'd hate pandas more than anything else in the Python ecosystem. It's probably not the best tool in the world, and sure, like everybody else I'd rewrite the universe in Rust if I could start over, and had infinite time to catch up.
But the code base I work on has thousands and THOUSANDS of lines of Pandas churning through big data, and I can't remember the last time it lead to a bug or error in production.
We use pandas + static schema wrapper + type checker, so you'll have to get exotic to break things.
Custom schema wrapper or some package you'd recommend from pypi?
He's a great writer and I miss his blog. He had an awesome post on pivot table that I think is now a part of this book.
He is also the creator of the Altair visualization library (Vega-Lite in Python https://altair-viz.github.io/). I really like using it.
Thanks for the fact, I used Altair sometimes and really admire the simplicity, not knowing it was written by Jake.
very cool!
it's written 8 years ago though, there is a 2ed of the book by the same author.
The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."
This is one of the few books that I read cover-to-cover when I was starting out learning Data Science in 2020/21. Will recommend.
I wouldn't say it's a handbook because it's more like an introduction. But it's pretty well written.
[flagged]
Pandas is cancer. Please stop teaching it to people.
Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).
Pandas code is untestable, unreadable, hard to refactor and impossible to reuse.
Trillions of dollars are wasted every year by people having to rewrite pandas code.
Code using pandas is testable and reusable in much the same way as any other code, make functions that take and return data.
That said, the polars/narwals style API is better than pandas API for sure. More readable and composable, simpler (no index) and a bit less weird overall.
Polars made the mistake of not maintaining row order for all operations, via the False-by-default argument of maintain_order. This is basically the billion-dollar null mistake for data frames.
Yeah that really should have been default. Very big footgun, especially when preserving ordering is default in pandas, numpy, etc. And especially when there is no ingrained index concept in polars, people might very well forget that one needs to have some natural keys and not rely on ordering. One needs to bring more of an SQL mindset.
> Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).
I see this take somewhat often, and usually with similar lack of nuance. How do you come to this? In other cases where I've seen this it's from people who haven't worked in any context where performance or scientific computing ecosystem interoperability matters - missing a massive part of the picture. I've struggled to get through to them before. Genuine question.
> Pandas code is untestable
The thousand-plus data integrity tests I've written in pandas tell a different story...
I've recently had to migrate over to Python from Matlab. Pandas has been doing my head in. The syntax is so unintuitive. In Matlab, everything begins with a `for` loop. Inelegant and slow, yes, but easy to reason about. Easy to see the scope and domain of the problem, to visualise the data wrangling.
Pandas insist you never use a for loop. So, I feel guilty if I ever need a throwaway variable on the way to creating a new column. Sometimes methods are attached to objects, other times they aren't. And if you need to use a function that isn't vectorised, you've got to do df.apply anyway. You have to remember to change the 'axis' too. Plotting is another thing that I can't get my head around. Am I supposed to use Pandas' helpers like df.plot() all the time? Or ditch it and use the low level matplotlib directly? What is idiomatic? I cannot find answers to much of it, even with ChatGPT. Worse, I can't seem to create a mental model of what Pandas expects me to do in a given situation.
Pandas has disabused me of the notion that Python syntax is self-explanatory and executable-pseudocode. I find it terrible to look at. Matlab was infinitely more enjoyable.
Yeah, pandas is truly awful. After working with things like R, ggplot, data.table, you soon realize pandas is the worst dataframe analysis and plotting library out there.
I pretty much consider anyone who likes it to have Stockholm syndrome.
Polars has a much more consistent API, give it a shot.
Regarding your plotting question: use seaborn when you can, but you’ll still need to know matplotlib.
A lot of people appreciate the declarative approach.
A for loop is a lot about the "how" but apply, join etc are much closer to the "what".
Maybe you are just bad at pandas.
I found Pandera quite good for wrapping input/output expectations over Pandas. At the end of the day the vectorisation of operations in it and other table based formats mean they’re not easy to replace performantly.
Can you write more about this? A lot of people use pandas where I work, whereas I'm completely fluent in list comprehensions and dataclasses etc. I had the impression it was doing something "more" like using numpy arrays/matrices for columns.