Top Data engineering repositories on GitHub
Pipelines, orchestrators, and ELT/ETL tooling.
Ranked by stars across 356 repositories tagged data-engineering. Refreshed daily.
- 1apache/superset★ 73,409 · ⑂ 17,674
Apache Superset is a Data Visualization and Data Exploration Platform
- superset
- apache
- apache-superset
- data-visualization
- data-viz
- analytics
- 2GokuMohandas/Made-With-ML★ 48,292 · ⑂ 7,594
Learn how to develop, deploy and iterate on production-grade ML applications.
- machine-learning
- deep-learning
- pytorch
- natural-language-processing
- data-science
- python
- 3apache/airflow★ 45,883 · ⑂ 17,264
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
- airflow
- apache
- apache-airflow
- python
- scheduler
- workflow
- 4DataTalksClub/data-engineering-zoomcamp★ 42,641 · ⑂ 8,441
Data Engineering Zoomcamp is a free 9-week course on building production-ready data pipelines. The next cohort starts in January 2026. Join the course here 👇🏼
- data-engineering
- kafka
- spark
- dbt
- docker
- kestra
- 5eugeneyan/applied-ml★ 29,811 · ⑂ 3,952
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
- applied-machine-learning
- production
- applied-data-science
- machine-learning
- data-science
- reinforcement-learning
- 6kestra-io/kestra★ 27,111 · ⑂ 2,627
Event Driven Orchestration & Scheduling Platform for Mission Critical Applications
- orchestration
- data-orchestration
- high-availability
- infrastructure-as-code
- automation
- devops
- 7PrefectHQ/prefect★ 22,653 · ⑂ 2,346
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
- python
- workflow
- data-engineering
- data-science
- workflow-engine
- prefect
- 8airbytehq/airbyte★ 21,501 · ⑂ 5,231
Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.
- data
- pipeline
- data-analysis
- data-engineering
- java
- python
- 9Avaiga/taipy★ 19,244 · ⑂ 1,987
Turns Data and AI algorithms into production-ready web applications in no time.
- automation
- data-engineering
- data-ops
- data-visualization
- datascience
- developer-tools
- 10argoproj/argo-workflows★ 16,774 · ⑂ 3,559
Workflow Engine for Kubernetes
- workflow
- kubernetes
- argo
- dag
- knative
- airflow
- 11dagster-io/dagster★ 15,728 · ⑂ 2,165
An orchestration platform for the development, production, and observation of data assets.
- data-pipelines
- dagster
- workflow
- data-science
- workflow-automation
- python
- 12andkret/Cookbook★ 15,148 · ⑂ 2,718
The Data Engineering Cookbook
- data-engineer
- data-engineering
- big-data
- best-practices
- cookbook
- 13datastacktv/data-engineer-roadmap★ 12,745 · ⑂ 1,338
Roadmap to becoming a data engineer in 2021
- data-engineer-roadmap
- data-engineering
- cloud
- roadmap
- 14fivetran/great_expectations★ 11,592 · ⑂ 1,767
Always know what to expect from your data.
- pipeline-tests
- dataquality
- datacleaning
- datacleaner
- data-science
- data-profiling
- 15cocoindex-io/cocoindex★ 10,436 · ⑂ 812
Incremental engine for long horizon agents 🌟 Star if you like it!
- ai
- change-data-capture
- data-indexing
- etl
- indexing
- python
- 16xonsh/xonsh★ 9,527 · ⑂ 729
🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.
- xonsh
- devops
- iterm2
- data-engineering
- security-automation
- raspberry-pi
- 17risingwavelabs/risingwave★ 9,092 · ⑂ 778
Event streaming platform for agentic AI. Continuously ingest, transform, and serve event streams in real time, at scale.
- database
- stream-processing
- rust
- postgresql
- kafka
- materialized-view
- 18mage-ai/mage-ai★ 8,757 · ⑂ 971
🧙 Build, run, and manage data pipelines for integrating and transforming data.
- machine-learning
- artificial-intelligence
- data
- data-engineering
- data-science
- python
- 19redpanda-data/connect★ 8,684 · ⑂ 945
Fancy stream processing made operationally mundane
- message-queue
- stream-processing
- streaming-data
- message-bus
- logs
- stream-processor
- 20growthbook/growthbook★ 7,897 · ⑂ 769
Open Source Feature Flags, Experimentation, and Product Analytics
- abtesting
- statistics
- abtest
- experimentation
- split-testing
- snowflake
- 21feast-dev/feast★ 7,100 · ⑂ 1,346
The Open Source Feature Store for AI/ML
- machine-learning
- features
- ml
- big-data
- feature-store
- python
- 22Zipstack/unstract★ 6,666 · ⑂ 633
LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows
- ai-agents
- data-engineering
- document-ai
- generative-ai
- idp
- json-extraction
- 23evidence-dev/evidence★ 6,484 · ⑂ 364
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
- analytics
- sql
- business-intelligence
- data-visualization
- dbt
- duckdb
- 24cloudquery/cloudquery★ 6,441 · ⑂ 549
Data pipelines for cloud config and security data. Build cloud asset inventory, CSPM, FinOps, and vulnerability management solutions. Extract from AWS, Azure, GCP, and 70+ cloud and SaaS sources.
- aws
- gcp
- azure
- sql
- data-integration
- elt
- 25Eventual-Inc/Daft★ 5,571 · ⑂ 493
High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
- machine-learning
- python
- data-engineering
- distributed-computing
- rust
- big-data
Find engineers shipping Data engineering
The list above ranks the most-starred public repositories tagged with the Data engineering topic, drawn from the public GitHub graph. Across 356 repositories tagged this way, the maintainers and top contributors are a tight cluster of the people actually building Data engineering.
Looking for engineers who’ve worked on Data engineering for real, not just listed it on LinkedIn? The fastest path is the contributor list of these repos. Their commits, issues, and READMEs are public proof of depth.
Refolk turns this list into a search. Ask for “maintainers of top Data engineering repos who are hiring”, “Data engineering engineers in San Francisco”, or “founders shipping Data engineering” and Refolk returns a ranked shortlist with sources.
How this list is built
Last refreshed: Sun, 21 Jun 2026 08:14:30 GMT
Need a list like this for any search?
Refolk runs natural-language searches across GitHub, LinkedIn, and the open web. Try one of these:
Browse other topics
- Top Cryptography repos
- Top RAG repos
- Top Deep learning repos
- Top Text-to-speech repos
- Top Fine-tuning repos
- Top Computer vision repos
- Top Observability repos
- Top DevOps repos
See all repository lists.