Summary
Python has become one of the dominant languages for data science and data analysis. Wes McKinney has been working for a decade to make tools that are easy and powerful, starting with the creation of Pandas, and eventually leading to his current work on Apache Arrow. In this episode he discusses his motivation for this work, what he sees as the current challenges to be overcome, and his hopes for the future of the industry.
Announcements
Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Check out the Practical AI podcast from our friends at Changelog Media to learn and stay up to date with what’s happening in AI
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Your host as usual is Tobias Macey and today I’m interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone
Interview
Introductions
How did you get introduced to Python?
You have spent a large portion of your career on building tools for data science and analytics in the Python ecosystem. What is your motivation for focusing on this problem domain?
Having been an open source author and contributor for many years now, what are your current thoughts on paths to sustainability?
What are some of the common challenges pertaining to data analysis that you have experienced in the various work environments and software projects that you have been involved in?
What area(s) of data science and analytics do you find are not receiving the attention that they deserve?
Recently there has been a lot of focus and excitement around the capabilities of neural networks and deep learning. In your experience, what are some of the shortcomings or blind spots to that class of approach that would be better served by other classes of solution?
Your most recent work is focused on the Arrow project for improving interoperability across languages. What are some of the cases where a Python developer would want to incorporate capabilities from other runtimes?
Do you think that we should be working to replicate some of those capabilities into the Python language and ecosystem, or is that wasted effort that would be better spent elsewhere?
Now that Pandas has been in active use for over a decade and you have had the opportunity to get some space from it, what are your thoughts on its success?
With the perspective that you have gained in that time, what would you do differently if you were starting over today?
You are best known for being the creator of Pandas, but can you list some of the other achievements that you are most proud of?
What projects are you most excited to be working on in the near to medium future?
What are your grand ambitions for the future of the data science community, both in and outside of the Python ecosystem?
Do you have any parting advice for active or aspiring data scientists, or resources that you would like to recommend?
Keep In Touch
wesm on GitHub
Website
@wesmckinn on Twitter
Picks
Tobias
Roald Dahl
Wes
The Soul Of A New Machine by Tracy Kidder
Links
Ursa Labs
Pandas
Podcast Interview with Jeff Reback
Pandas Extension Arrays Interview with Tom Augsburger
AQR Capital Management
Distributed Computing
SQL
Excel
Duke University
AppNexus
Chang She
Ibis
Open Source Governance
Apache Software Foundation
Paul Graham
Schlep Blindness
Big Data File Formats
Avro
Parquet
ORC
Data Engineering Podcast Episode
Apache Arrow
Hadoop
Spark
Data Engineering Podcast Episode
Apache Impala
R Language
Ruby
Rust
Pandas 2.0 Design Docs
Apache Arrow and the 10 Things I Hate About Pandas
GeoPandas
Statsmodels
Python For Data Analysis by Wes McKinney
2 Sigma
R Studio
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA