Skip to article frontmatterSkip to article content

2024 Landscape Analysis

Python is widely adopted in data science, and its use for statistics is expanding rapidly, particularly in education and applied research. The statistical ecosystem in Python is currently anchored by four major libraries:

These core libraries are generally well-tested, reliable, and uphold high software engineering standards, making them trusted foundations for research and application. Libraries like scikit-learn are especially valued for their clean, consistent interfaces and their integration with the broader Python data stack, which streamlines workflows and enhances usability for both new and experienced users.

While there are many smaller, specialized packages available, the ecosystem remains dominated by these large, general-purpose libraries. This concentration of resources ensures stability and quality but can also limit the visibility and adoption of innovative or niche statistical tools. As Python’s role in statistics continues to grow, fostering a more diverse and accessible ecosystem will be key to meeting the evolving needs of educators, researchers, and practitioners.

Relationship to Other Languages

R remains the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources. R’s tidyverse and RStudio provide a smoother user experience for statistics, and CRAN offers a vast repository of statistical packages.

Table 1:Python vs. R for Statistics

AspectPython (Scientific Python)R (CRAN, tidyverse)
Core Librariesscipy.stats, statsmodels, scikit-learnbase R, tidyverse, many CRAN packages
User ExperienceFragmented, less cohesiveCohesive, tidyverse pipelines, RStudio
Teaching ResourcesImproving, but less abundantExtensive, beginner-friendly
CommunityLarge, less connected in statisticsStrong, statistics-focused, welcoming
Package DevelopmentHigh barriers, less modularityEasy, many small packages, dev tools
InteroperabilityNeeds improvement (data structures, APIs)Strong within tidyverse, RStudio
BrandingData science/machine learning focusStatistics-focused

Interoperability: While some users switch between Python and R in their workflows, true interoperability is limited. Most projects use one language at a time, often leveraging R for data manipulation and Python for modeling or vice versa.

Other Platforms: Tools like GraphPad Prism remain popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates all applied domains.

Weaknesses and Needs

Despite Python’s strengths, several challenges remain.

Conclusion

Python’s statistics ecosystem is powerful but fragmented, with significant opportunities for improvement in usability, interoperability, teaching resources, and community cohesion. While R remains the default for statistics, Python is gaining ground, especially as data science and machine learning continue to grow in influence. Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain. In particular, Python needs:

The Statistical Python project seeks to address these needs by fostering collaboration, sharing best practices, and building a sustainable, inclusive community. As a domain stack within the Scientific Python project, and with support from the NSF POSE Phase I grant, we are committed to making Python a premier platform for statistical computing, education, and research.