Data Engineering

11 repositories · AI, LLMs & Data

All subcategories in AI, LLMs & Data

Repositories — sorted by stars

Repository Stars Language Description
pandas-dev/pandas ⭐ 48.7K Python Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
apache/airflow ⭐ 45.3K Python Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
pola-rs/polars ⭐ 38.4K Rust Extremely fast Query Engine for DataFrames, written in Rust
Unstructured-IO/unstructured ⭐ 14.6K HTML Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
jupyter/notebook ⭐ 13.1K Jupyter Notebook Jupyter Interactive Notebook
kedro-org/kedro ⭐ 10.9K Python Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
treeverse/lakeFS ⭐ 5.3K Go lakeFS - Data version control for your data lake | Git for data
togethercomputer/RedPajama-Data ⭐ 4.9K Python The RedPajama-Data repository contains code for preparing large datasets for training large language models.
n-riesco/ijavascript ⭐ 2.3K JavaScript IJavascript is a javascript kernel for the Jupyter notebook
xavctn/img2table ⭐ 865 Python img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
ricklupton/floweaver ⭐ 464 Python View flow data as Sankey diagrams

Showing 11 repositories