Refolk

Top Python Data engineering repositories on GitHub

Pipelines, orchestrators, and ELT/ETL tooling. Filtered to projects whose primary language is Python.

Ranked by stars across 251 Python repositories tagged data-engineering. Refreshed daily.

  1. 1
    apache/airflow45,902 · ⑂ 17,283

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    • airflow
    • apache
    • apache-airflow
    • python
    • scheduler
    • workflow
  2. 2
    PrefectHQ/prefect22,673 · ⑂ 2,348

    Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

    • python
    • workflow
    • data-engineering
    • data-science
    • workflow-engine
    • prefect
  3. 3
    airbytehq/airbyte21,513 · ⑂ 5,236

    Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.

    • data
    • pipeline
    • data-analysis
    • data-engineering
    • java
    • python
  4. 4
    Avaiga/taipy19,246 · ⑂ 1,985

    Turns Data and AI algorithms into production-ready web applications in no time.

    • automation
    • data-engineering
    • data-ops
    • data-visualization
    • datascience
    • developer-tools
  5. 5
    dagster-io/dagster15,736 · ⑂ 2,168

    An orchestration platform for the development, production, and observation of data assets.

    • data-pipelines
    • dagster
    • workflow
    • data-science
    • workflow-automation
    • python
  6. 6
    andkret/Cookbook15,155 · ⑂ 2,719

    The Data Engineering Cookbook

    • data-engineer
    • data-engineering
    • big-data
    • best-practices
    • cookbook
  7. 7
    fivetran/great_expectations11,598 · ⑂ 1,768

    Always know what to expect from your data.

    • pipeline-tests
    • dataquality
    • datacleaning
    • datacleaner
    • data-science
    • data-profiling
  8. 8
    xonsh/xonsh9,529 · ⑂ 730

    🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.

    • xonsh
    • devops
    • iterm2
    • data-engineering
    • security-automation
    • raspberry-pi
  9. 9
    mage-ai/mage-ai8,757 · ⑂ 971

    🧙 Build, run, and manage data pipelines for integrating and transforming data.

    • machine-learning
    • artificial-intelligence
    • data
    • data-engineering
    • data-science
    • python
  10. 10
    feast-dev/feast7,102 · ⑂ 1,349

    The Open Source Feature Store for AI/ML

    • machine-learning
    • features
    • ml
    • big-data
    • feature-store
    • python
  11. 11
    Zipstack/unstract6,669 · ⑂ 633

    LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

    • ai-agents
    • data-engineering
    • document-ai
    • generative-ai
    • idp
    • json-extraction
  12. 12
    dlt-hub/dlt5,509 · ⑂ 530

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    • data
    • python
    • data-engineering
    • data-lake
    • data-loading
    • data-warehouse
  13. 13
    ruc-datalab/DeepAnalyze4,278 · ⑂ 686

    DeepAnalyze is the first agentic LLM for autonomous data science. 🎈你的AI数据分析师,自动分析大量数据,一键生成专业分析报告!

    • agent
    • agentic
    • agentic-ai
    • chatbot
    • data
    • data-analysis
  14. 14
    aws/aws-sdk-pandas4,109 · ⑂ 733

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    • python
    • aws
    • pandas
    • apache-arrow
    • apache-parquet
    • data-engineering
  15. 15
    ploomber/ploomber3,622 · ⑂ 241

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

    • workflow
    • machine-learning
    • data-science
    • data-engineering
    • mlops
    • papermill
  16. 16
    datafold/data-diff2,989 · ⑂ 308

    Compare tables within or across databases

    • database
    • mysql
    • postgresql
    • snowflake
    • rdbms
    • trino
  17. 17
    meltano/meltano2,540 · ⑂ 246

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

    • dataops
    • dataops-platform
    • elt
    • open-source
    • opensource
    • data
  18. 18
    sodadata/soda-core2,376 · ⑂ 276

    Data Contracts engine for the modern data stack. https://www.soda.io

    • python
    • data-engineering
    • data-governance
    • data-monitoring
    • data-observability
    • data-profiling
  19. 19

    Implementing best practices for PySpark ETL jobs and applications.

    • pyspark
    • etl-job
    • python
    • data-engineering
    • spark
    • data-science
  20. 20
    bytewax/bytewax2,023 · ⑂ 110

    Python Stream Processing

    • python
    • stream-processing
    • rust
    • data-engineering
    • data-processing
    • data-science
  21. 21

    Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

    • data
    • data-engineering
    • data-engineering-pipeline
    • etl-pipeline
    • cassandra-database
    • postgresql-database
  22. 22
    mlrun/mlrun1,674 · ⑂ 308

    MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.

    • mlops
    • python
    • data-science
    • machine-learning
    • data-engineering
    • experiment-tracking
  23. 23

    More than 2000+ Data engineer interview questions.

    • data-engineering
    • interview-questions
    • interview
    • hadoop
    • hadoop-hdfs
    • spark
  24. 24
    quixio/quix-streams1,554 · ⑂ 107

    Python Streaming DataFrames for Kafka

    • kafka
    • python
    • stream-processing
    • data-engineering
    • data-science
    • machine-learning
  25. 25
    pyper-dev/pyper1,521 · ⑂ 31

    Concurrent Python made simple

    • asyncio
    • concurrency
    • python
    • threading
    • data-pipelines
    • data-processing

Find Python engineers shipping Data engineering

The list above ranks the most-starred public Python repositories tagged with the Data engineering topic, drawn from the public GitHub graph. Across 251 matching repositories, the contributors are a tight cluster of engineers with both Python chops and real Data engineering experience.

That overlap is rare. Most Python engineers haven’t shipped Data engineering, and most Data engineering maintainers don’t write Python. The people on this list’s contributor graph are the ones who do both.

Refolk turns this list into a search. Ask for Python Data engineering maintainers hiring” or Python engineers shipping Data engineering in 2025” and Refolk returns a ranked shortlist with the commits, profiles, and projects behind each name.

How this list is built

Refolk searched GitHub for public Python repositories tagged with the Data engineering topic, ranked them by stargazer count, and kept those with at least 25 stars. The list refreshes once a day.

Last refreshed: Tue, 23 Jun 2026 20:26:27 GMT

Need a more specific search?

Refolk runs natural-language searches across GitHub, LinkedIn, and the open web. Try one of these:

Related lists

See all repository lists.

Or zoom out