Data Engineering
11 repositories · AI, LLMs & Data
All subcategories in AI, LLMs & Data
Repositories — sorted by stars
| Repository | Stars | Language | Description |
|---|---|---|---|
| pandas-dev/pandas | ⭐ 48.7K | Python | Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more |
| apache/airflow | ⭐ 45.3K | Python | Apache Airflow - A platform to programmatically author, schedule, and monitor workflows |
| pola-rs/polars | ⭐ 38.4K | Rust | Extremely fast Query Engine for DataFrames, written in Rust |
| Unstructured-IO/unstructured | ⭐ 14.6K | HTML | Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding. |
| jupyter/notebook | ⭐ 13.1K | Jupyter Notebook | Jupyter Interactive Notebook |
| kedro-org/kedro | ⭐ 10.9K | Python | Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular. |
| treeverse/lakeFS | ⭐ 5.3K | Go | lakeFS - Data version control for your data lake | Git for data |
| togethercomputer/RedPajama-Data | ⭐ 4.9K | Python | The RedPajama-Data repository contains code for preparing large datasets for training large language models. |
| n-riesco/ijavascript | ⭐ 2.3K | JavaScript | IJavascript is a javascript kernel for the Jupyter notebook |
| xavctn/img2table | ⭐ 865 | Python | img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing |
| ricklupton/floweaver | ⭐ 464 | Python | View flow data as Sankey diagrams |
Showing 11 repositories