Top Scala Data engineering repositories on GitHub
Pipelines, orchestrators, and ELT/ETL tooling. Filtered to projects whose primary language is Scala.
Ranked by stars across 14 Scala repositories tagged data-engineering. Refreshed daily.
- 1metarank/metarank★ 2,432 · ⑂ 109
A low code Machine Learning personalized ranking service for articles, listings, search results, recommendations that boosts user engagement. A friendly Learn-to-Rank engine
- ranking
- scala
- search
- personalization
- machine-learning
- deep-learning
- 2feathr-ai/feathr★ 1,929 · ⑂ 244
Feathr – A scalable, unified data and AI engineering platform for enterprise
- feature-engineering
- feature-store
- artificial-intelligence
- mlops
- data-engineering
- data-quality
- 3starlake-ai/starlake★ 201 · ⑂ 28
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
- spark
- bigquery
- hdfs
- redshift
- snowflake
- synapse
- 4swoop-inc/spark-alchemy★ 191 · ⑂ 33
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
- spark
- data-science
- scala
- data-engineering
- 5SETL-Framework/setl★ 186 · ⑂ 33
A simple Spark-powered ETL framework that just works 🍺
- spark
- etl
- framework
- scala
- setl
- pipeline
- 6dimajix/flowman★ 97 · ⑂ 19
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.
- hadoop
- spark
- scala
- etl
- flowman
- data-engineering
- 7galliaproject/gallia-core★ 89 · ⑂ 4
A schema-aware Scala library for data transformation
- scala
- data-transformation
- json
- spark
- etl
- nesting
- 8StabRise/spark-pdf★ 81 · ⑂ 4
PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it
- ocr
- ocr-recognition
- pdf-document
- pdf-document-processor
- spark
- 9CoxAutomotiveDataSolutions/waimak★ 76 · ⑂ 17
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
- spark
- hadoop
- data-engineering
- scala
- 10mattlianje/etl4s★ 73 · ⑂ 5
Powerful, whiteboard-style ETL
- etl
- functional-programming
- streaming
- big-data
- data-engineering
- 11opensnowcat/opensnowcat-collector★ 72 · ⑂ 7
OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)
- snowplow
- analytics
- event-pipeline
- data-engineering
- data-pipeline
- 12CoxAutomotiveDataSolutions/spark-distcp★ 47 · ⑂ 33
A re-implementation of Hadoop DistCP in Apache Spark
- spark
- apache-spark
- hadoop
- distcp
- data-engineering
- 13vitaliihonta/scala-ql★ 38 · ⑂ 1
Data manipulation and reporting for Scala.
- data-engineering
- dsl
- functional
- scala
- csv
- excel
- 14opensnowcat/opensnowcat-enrich★ 30 · ⑂ 8
OpenSnowcat Enricher (Apache 2.0 License)
- analytics
- event-pipeline
- snowplow
- data-engineering
- data-pipeline
- ai-data-collection
Find Scala engineers shipping Data engineering
The list above ranks the most-starred public Scala repositories tagged with the Data engineering topic, drawn from the public GitHub graph. Across 14 matching repositories, the contributors are a tight cluster of engineers with both Scala chops and real Data engineering experience.
That overlap is rare. Most Scala engineers haven’t shipped Data engineering, and most Data engineering maintainers don’t write Scala. The people on this list’s contributor graph are the ones who do both.
Refolk turns this list into a search. Ask for “Scala Data engineering maintainers hiring” or “Scala engineers shipping Data engineering in 2025” and Refolk returns a ranked shortlist with the commits, profiles, and projects behind each name.
How this list is built
Last refreshed: Thu, 07 May 2026 06:52:03 GMT
Need a more specific search?
Refolk runs natural-language searches across GitHub, LinkedIn, and the open web. Try one of these:
Related lists
- Python · Data engineering
- Clojure · Data engineering
- TypeScript · React
- TypeScript · Next.js
- TypeScript · Vue
- TypeScript · Svelte
- TypeScript · Tailwind CSS
- TypeScript · GraphQL
See all repository lists.