Post Content
Learn how to benchmark embedding models on your own data in this course for beginners.
In this course, you will learn:
– The limitations of extracting text from PDF files with Python libraries and to solve that with the help of VLMs (Vision Language Models).
– How to divide the extracted text into chunks that preserve context.
– Generation questions for each chunk using LLMs (Large Language Models).
– Use embedding models to create vector representations of the chunks and questions.
– Use both open source and proprietary embedding models.
– Use llama.cpp to run models in the GGUF format locally on your machine.
– Perform the benchmarking of different embedding models using various metrics and statistical tests with the help of ranx.
– Plot the vector representations to visualize if clusters are being formed.
– Understand how to interpret the p-value that a statistical test provides.
– And much more!
You can find the slides, notebook, and scripts in this GitHub repository:
https://github.com/ImadSaddik/Benchmark_Embedding_Models
The dataset is available here:
https://huggingface.co/datasets/ImadSaddik/BenchmarkEmbeddingModelsCourse
To connect with Imad Saddik, check out his social accounts:
LinkedIn: https://www.linkedin.com/in/imadsaddik/
YouTube: https://www.youtube.com/@3CodeCampers
Website: https://imadsaddik.com/
⭐️ Course Contents ⭐️
(0:00:00) About the course
(0:06:05) Introduction
(0:17:58) Extracting text from PDF documents
(1:01:08) Divide text into coherent chunks
(1:23:10) Generate question-answer pairs from text chunks
(1:38:48) Embed text chunks and questions
(2:17:06) Statistical tests and metrics
(3:12:01) Expanding the dataset and adding more languages
(3:45:24) Conclusion Read More freeCodeCamp.org
#programming #freecodecamp #learn #learncode #learncoding