Refolk

Top Data engineering repositories on GitHub

Pipelines, orchestrators, and ELT/ETL tooling.

Ranked by stars across 344 repositories tagged data-engineering. Refreshed daily.

  1. 1
    apache/superset72,716 · ⑂ 17,207

    Apache Superset is a Data Visualization and Data Exploration Platform

    • superset
    • apache
    • apache-superset
    • data-visualization
    • data-viz
    • analytics
  2. 2
    GokuMohandas/Made-With-ML47,510 · ⑂ 7,480

    Learn how to develop, deploy and iterate on production-grade ML applications.

    • machine-learning
    • deep-learning
    • pytorch
    • natural-language-processing
    • data-science
    • python
  3. 3
    apache/airflow45,307 · ⑂ 17,010

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    • airflow
    • apache
    • apache-airflow
    • python
    • scheduler
    • workflow
  4. 4

    Data Engineering Zoomcamp is a free 9-week course on building production-ready data pipelines. The next cohort starts in January 2026. Join the course here 👇🏼

    • data-engineering
    • kafka
    • spark
    • dbt
    • docker
    • kestra
  5. 5
    eugeneyan/applied-ml28,799 · ⑂ 3,841

    📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

    • applied-machine-learning
    • production
    • applied-data-science
    • machine-learning
    • data-science
    • reinforcement-learning
  6. 6
    PrefectHQ/prefect22,318 · ⑂ 2,294

    Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

    • python
    • workflow
    • data-engineering
    • data-science
    • workflow-engine
    • prefect
  7. 7
    airbytehq/airbyte21,210 · ⑂ 5,165

    Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.

    • data
    • pipeline
    • data-analysis
    • data-engineering
    • java
    • python
  8. 8
    Avaiga/taipy19,174 · ⑂ 1,971

    Turns Data and AI algorithms into production-ready web applications in no time.

    • automation
    • data-engineering
    • data-ops
    • data-visualization
    • datascience
    • developer-tools
  9. 9
    argoproj/argo-workflows16,668 · ⑂ 3,520

    Workflow Engine for Kubernetes

    • workflow
    • kubernetes
    • argo
    • dag
    • knative
    • airflow
  10. 10
    dagster-io/dagster15,439 · ⑂ 2,112

    An orchestration platform for the development, production, and observation of data assets.

    • data-pipelines
    • dagster
    • workflow
    • data-science
    • workflow-automation
    • python
  11. 11
    andkret/Cookbook15,078 · ⑂ 2,705

    The Data Engineering Cookbook

    • data-engineer
    • data-engineering
    • big-data
    • best-practices
    • cookbook
  12. 12
    datastacktv/data-engineer-roadmap12,753 · ⑂ 1,344

    Roadmap to becoming a data engineer in 2021

    • data-engineer-roadmap
    • data-engineering
    • cloud
    • roadmap
  13. 13

    Always know what to expect from your data.

    • pipeline-tests
    • dataquality
    • datacleaning
    • datacleaner
    • data-science
    • data-profiling
  14. 14
    xonsh/xonsh9,323 · ⑂ 721

    🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.

    • xonsh
    • devops
    • iterm2
    • data-engineering
    • security-automation
    • raspberry-pi
  15. 15
    risingwavelabs/risingwave8,986 · ⑂ 764

    Event streaming platform for agentic AI. Continuously ingest, transform, and serve event streams in real time, at scale.

    • database
    • stream-processing
    • rust
    • postgresql
    • kafka
    • materialized-view
  16. 16
    mage-ai/mage-ai8,714 · ⑂ 962

    🧙 Build, run, and manage data pipelines for integrating and transforming data.

    • machine-learning
    • artificial-intelligence
    • data
    • data-engineering
    • data-science
    • python
  17. 17
    cocoindex-io/cocoindex8,661 · ⑂ 640

    Incremental engine for long horizon agents 🌟 Star if you like it!

    • ai
    • change-data-capture
    • data-indexing
    • etl
    • indexing
    • python
  18. 18
    redpanda-data/connect8,658 · ⑂ 939

    Fancy stream processing made operationally mundane

    • message-queue
    • stream-processing
    • streaming-data
    • message-bus
    • logs
    • stream-processor
  19. 19
    growthbook/growthbook7,731 · ⑂ 741

    Open Source Feature Flags, Experimentation, and Product Analytics

    • abtesting
    • statistics
    • abtest
    • experimentation
    • split-testing
    • snowflake
  20. 20
    feast-dev/feast7,008 · ⑂ 1,315

    The Open Source Feature Store for AI/ML

    • machine-learning
    • features
    • ml
    • big-data
    • feature-store
    • python
  21. 21
    cloudquery/cloudquery6,397 · ⑂ 544

    Data pipelines for cloud config and security data. Build cloud asset inventory, CSPM, FinOps, and vulnerability management solutions. Extract from AWS, Azure, GCP, and 70+ cloud and SaaS sources.

    • aws
    • gcp
    • azure
    • sql
    • data-integration
    • elt
  22. 22
    evidence-dev/evidence6,296 · ⑂ 348

    Business intelligence as code: build fast, interactive data visualizations in SQL and markdown

    • analytics
    • sql
    • business-intelligence
    • data-visualization
    • dbt
    • duckdb
  23. 23
    Eventual-Inc/Daft5,454 · ⑂ 462

    High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale

    • machine-learning
    • python
    • data-engineering
    • distributed-computing
    • rust
    • big-data
  24. 24
    treeverse/lakeFS5,330 · ⑂ 446

    lakeFS - Data version control for your data lake | Git for data

    • data-engineering
    • data-versioning
    • go
    • object-storage
    • data-lake
    • aws-s3
  25. 25
    dlt-hub/dlt5,297 · ⑂ 500

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    • data
    • python
    • data-engineering
    • data-lake
    • data-loading
    • data-warehouse

Find engineers shipping Data engineering

The list above ranks the most-starred public repositories tagged with the Data engineering topic, drawn from the public GitHub graph. Across 344 repositories tagged this way, the maintainers and top contributors are a tight cluster of the people actually building Data engineering.

Looking for engineers who’ve worked on Data engineering for real, not just listed it on LinkedIn? The fastest path is the contributor list of these repos. Their commits, issues, and READMEs are public proof of depth.

Refolk turns this list into a search. Ask for “maintainers of top Data engineering repos who are hiring”, Data engineering engineers in San Francisco”, or “founders shipping Data engineering” and Refolk returns a ranked shortlist with sources.

How this list is built

Refolk searched GitHub for public repositories tagged with the Data engineering topic, ranked them by stargazer count, and kept those with at least 50 stars. The list refreshes once a day.

Last refreshed: Thu, 07 May 2026 05:55:02 GMT

Need a list like this for any search?

Refolk runs natural-language searches across GitHub, LinkedIn, and the open web. Try one of these:

Browse other topics

See all repository lists.

Data engineering by language