Chapters (9)
- 0:00About the course
- 6:05Introduction
- 17:58Extracting text from PDF documents
- 1:01:08Divide text into coherent chunks
- 1:23:10Generate question-answer pairs from text chunks
- 1:38:48Embed text chunks and questions
- 2:17:06Statistical tests and metrics
- 3:12:01Expanding the dataset and adding more languages
- 3:45:24Conclusion
Show the creator's full description
Learn how to benchmark embedding models on your own data in this course for beginners.
In this course, you will learn:
- The limitations of extracting text from PDF files with Python libraries and to solve that with the help of VLMs (Vision Language Models).
- How to divide the extracted text into chunks that preserve context.
- Generation questions for each chunk using LLMs (Large Language Models).
- Use embedding models to create vector representations of the chunks and questions.
- Use both open source and proprietary embedding models.
- Use llama.cpp to run models in the GGUF format locally on your machine.
- Perform the benchmarking of different embedding models using various metrics and statistical tests with the help of ranx.
- Plot the vector representations to visualize if clusters are being formed.
- Understand how to interpret the p-value that a statistical test provides.
- And much more!
You can find the slides, notebook, and scripts in this GitHub repository:
https://github.com/ImadSaddik/Benchmark_Embedding_Models
The dataset is available here:
https://huggingface.co/datasets/ImadSaddik/BenchmarkEmbeddingModelsCourse
To connect with Imad Saddik, check out his social accounts:
LinkedIn: https://www.linkedin.com/in/imadsaddik/
YouTube: https://www.youtube.com/@3CodeCampers
Website: https://imadsaddik.com/
⭐️ Course Contents ⭐️
(0:00:00) About the course
(0:06:05) Introduction
(0:17:58) Extracting text from PDF documents
(1:01:08) Divide text into coherent chunks
(1:23:10) Generate question-answer pairs from text chunks
(1:38:48) Embed text chunks and questions
(2:17:06) Statistical tests and metrics
(3:12:01) Expanding the dataset and adding more languages
(3:45:24) Conclusion
Description and video by freeCodeCamp.org. This page is an independent companion view; the video is embedded from YouTube.