Top Data engineering repositories on GitHub
Pipelines, orchestrators, and ELT/ETL tooling.
Ranked by stars across 344 repositories tagged data-engineering. Refreshed daily.
- 1apache/superset★ 72,716 · ⑂ 17,207
Apache Superset is a Data Visualization and Data Exploration Platform
- superset
- apache
- apache-superset
- data-visualization
- data-viz
- analytics
- 2GokuMohandas/Made-With-ML★ 47,510 · ⑂ 7,480
Learn how to develop, deploy and iterate on production-grade ML applications.
- machine-learning
- deep-learning
- pytorch
- natural-language-processing
- data-science
- python
- 3apache/airflow★ 45,307 · ⑂ 17,010
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
- airflow
- apache
- apache-airflow
- python
- scheduler
- workflow
- 4DataTalksClub/data-engineering-zoomcamp★ 40,688 · ⑂ 8,126
Data Engineering Zoomcamp is a free 9-week course on building production-ready data pipelines. The next cohort starts in January 2026. Join the course here 👇🏼
- data-engineering
- kafka
- spark
- dbt
- docker
- kestra
- 5eugeneyan/applied-ml★ 28,799 · ⑂ 3,841
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
- applied-machine-learning
- production
- applied-data-science
- machine-learning
- data-science
- reinforcement-learning
- 6PrefectHQ/prefect★ 22,318 · ⑂ 2,294
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
- python
- workflow
- data-engineering
- data-science
- workflow-engine
- prefect
- 7airbytehq/airbyte★ 21,210 · ⑂ 5,165
Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.
- data
- pipeline
- data-analysis
- data-engineering
- java
- python
- 8Avaiga/taipy★ 19,174 · ⑂ 1,971
Turns Data and AI algorithms into production-ready web applications in no time.
- automation
- data-engineering
- data-ops
- data-visualization
- datascience
- developer-tools
- 9argoproj/argo-workflows★ 16,668 · ⑂ 3,520
Workflow Engine for Kubernetes
- workflow
- kubernetes
- argo
- dag
- knative
- airflow
- 10dagster-io/dagster★ 15,439 · ⑂ 2,112
An orchestration platform for the development, production, and observation of data assets.
- data-pipelines
- dagster
- workflow
- data-science
- workflow-automation
- python
- 11andkret/Cookbook★ 15,078 · ⑂ 2,705
The Data Engineering Cookbook
- data-engineer
- data-engineering
- big-data
- best-practices
- cookbook
- 12datastacktv/data-engineer-roadmap★ 12,753 · ⑂ 1,344
Roadmap to becoming a data engineer in 2021
- data-engineer-roadmap
- data-engineering
- cloud
- roadmap
- 13great-expectations/great_expectations★ 11,462 · ⑂ 1,744
Always know what to expect from your data.
- pipeline-tests
- dataquality
- datacleaning
- datacleaner
- data-science
- data-profiling
- 14xonsh/xonsh★ 9,323 · ⑂ 721
🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.
- xonsh
- devops
- iterm2
- data-engineering
- security-automation
- raspberry-pi
- 15risingwavelabs/risingwave★ 8,986 · ⑂ 764
Event streaming platform for agentic AI. Continuously ingest, transform, and serve event streams in real time, at scale.
- database
- stream-processing
- rust
- postgresql
- kafka
- materialized-view
- 16mage-ai/mage-ai★ 8,714 · ⑂ 962
🧙 Build, run, and manage data pipelines for integrating and transforming data.
- machine-learning
- artificial-intelligence
- data
- data-engineering
- data-science
- python
- 17cocoindex-io/cocoindex★ 8,661 · ⑂ 640
Incremental engine for long horizon agents 🌟 Star if you like it!
- ai
- change-data-capture
- data-indexing
- etl
- indexing
- python
- 18redpanda-data/connect★ 8,658 · ⑂ 939
Fancy stream processing made operationally mundane
- message-queue
- stream-processing
- streaming-data
- message-bus
- logs
- stream-processor
- 19growthbook/growthbook★ 7,731 · ⑂ 741
Open Source Feature Flags, Experimentation, and Product Analytics
- abtesting
- statistics
- abtest
- experimentation
- split-testing
- snowflake
- 20feast-dev/feast★ 7,008 · ⑂ 1,315
The Open Source Feature Store for AI/ML
- machine-learning
- features
- ml
- big-data
- feature-store
- python
- 21cloudquery/cloudquery★ 6,397 · ⑂ 544
Data pipelines for cloud config and security data. Build cloud asset inventory, CSPM, FinOps, and vulnerability management solutions. Extract from AWS, Azure, GCP, and 70+ cloud and SaaS sources.
- aws
- gcp
- azure
- sql
- data-integration
- elt
- 22evidence-dev/evidence★ 6,296 · ⑂ 348
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
- analytics
- sql
- business-intelligence
- data-visualization
- dbt
- duckdb
- 23Eventual-Inc/Daft★ 5,454 · ⑂ 462
High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
- machine-learning
- python
- data-engineering
- distributed-computing
- rust
- big-data
- 24treeverse/lakeFS★ 5,330 · ⑂ 446
lakeFS - Data version control for your data lake | Git for data
- data-engineering
- data-versioning
- go
- object-storage
- data-lake
- aws-s3
- 25dlt-hub/dlt★ 5,297 · ⑂ 500
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
- data
- python
- data-engineering
- data-lake
- data-loading
- data-warehouse
Find engineers shipping Data engineering
The list above ranks the most-starred public repositories tagged with the Data engineering topic, drawn from the public GitHub graph. Across 344 repositories tagged this way, the maintainers and top contributors are a tight cluster of the people actually building Data engineering.
Looking for engineers who’ve worked on Data engineering for real, not just listed it on LinkedIn? The fastest path is the contributor list of these repos. Their commits, issues, and READMEs are public proof of depth.
Refolk turns this list into a search. Ask for “maintainers of top Data engineering repos who are hiring”, “Data engineering engineers in San Francisco”, or “founders shipping Data engineering” and Refolk returns a ranked shortlist with sources.
How this list is built
Last refreshed: Thu, 07 May 2026 05:55:02 GMT
Need a list like this for any search?
Refolk runs natural-language searches across GitHub, LinkedIn, and the open web. Try one of these:
Browse other topics
- Top Cryptography repos
- Top RAG repos
- Top Deep learning repos
- Top Text-to-speech repos
- Top Fine-tuning repos
- Top Computer vision repos
- Top Observability repos
- Top DevOps repos
See all repository lists.