Hi, I am Mehdi
Welcome to my personal webpage! As a Principal Data Scientist, my expertise encompasses diverse areas such as Advanced Machine Learning, Predictive Modeling, Customer Analytics, and Segmentation. I specialize in developing and deploying cutting-edge Large Language Models (LLMs) and Generative AI solutions. Holding a PhD in Geostatistics from the University of Alberta, I bring extensive experience across various industries, including engineering (oil & gas), telecommunications, finance (banking), and academia. I have shared some of my notable projects, blogs, lectures, and publications. Feel free to reach out if you have any questions or would like to connect! |
Selected Projects
Over the years as a data scientist, I have encountered and addressed a variety of real-world data science challenges. Below, you’ll find a curated selection of projects that demonstrate my approach to solving these problems. Each project includes both code and documentation to provide a clear understanding of the issue and its solution. Links to Python code, notebooks, and the public datasets used are provided, enabling you to explore and run the notebooks on your own.
LangChain in Action: LLM Applications with RAG and Agents
Demonstrating LangChain's capabilities through chat models, prompt templates, memory management, and chains. Highlighting advanced implementations like Retrieval Augmented Generation (RAG) and intelligent agents for Multi-Doc-Chatbots. Showcasing versatile, LLM-powered solutions. |
Fine Tune LLaMA-Model with Customer Support-Chatbot Dataset
Showcasing fine-tuning of the LLaMA model using a customer support chatbot dataset. Highlighting improvements in response accuracy and domain-specific understanding. Demonstrating practical applications for tailored, conversational AI solutions. |
Survival Analysis for Customer Churn
Applying survival analysis to Telco Customer Churn data for predicting customers likely to churn. Utilizing non-parametric and parametric techniques such as Kaplan-Meier, Weibull, and Cox Hazard. Highlighting insights into churn timing and influencing factors for better business strategies. |
Customer Segmentation for Online Retail
Applying clustering to group customers based on behavior and interests. Utilizing mini-batch k-means for optimal cluster centroids and defining segments based on feature averages. Visualizing customer groups using PCA for actionable insights in targeted marketing strategies. |
Streamlit App to Train Model for Customer Churn Prediction
Building a Streamlit app to train and deploy a customer churn prediction model. Enabling users to input data and visualize churn risk predictions. Providing an interactive interface for model training, evaluation, and insights into customer retention strategies. |
Classification With Imbalanced Class
Addressing imbalanced classification by applying resampling techniques (oversampling, undersampling, and combinations) to adjust class distribution in training data. Ensuring more balanced data for predictive models to improve minority class performance. Utilizing K-fold cross-validation to evaluate model accuracy, while noting challenges of overfitting due to class distribution differences. |
Fine-Tuning BERT for Fake News Detection
Fine-tuning a pre-trained BERT model for fake news detection by adding a classification layer and training on a labeled dataset. Leveraging BERT's ability to capture contextual information for improved text classification. Exploring the process of adapting BERT for the specific task of detecting fake news. |
Time Series Prediction with Real-World Datasets
Applying time series prediction using realistic datasets, including Sunspots and Total Energy Consumption in Alberta. Utilizing supervised learning techniques (linear regression, neural networks, decision trees, random forest) and deep learning models (RNN, LSTM, GRU) to forecast future values. Calculating uncertainty for future predictions. |
Fine-tune BART for Tweet Classification
Leveraging Facebook's BART model for sentiment classification on Coronavirus-related tweets. Initially applying zero-shot classification to predict tweet sentiments, then fine-tuning the model using labeled data for improved accuracy. The fine-tuning process enables BART to adapt to the specific task, enhancing its ability to predict tweet sentiments with higher precision. |
Prediction of Serious Fluid Leakage for Alberta Well Energy
Using machine learning to predict serious fluid leakage in hydrocarbon wells in Alberta, Canada. Analyzing well properties like age, depth, and production history to assess the risk of leakage. Addressing class imbalance in the dataset with sampling techniques to improve model accuracy. The study focuses on identifying reliable predictive models to forecast fluid leakage and assess potential environmental risks. |
Majority Vote Technique for Feature Importance
Applying various predictive models (Linear Regression, Decision Tree, Random Forest, Gradient Boosting) to assess feature importance in both regression and classification problems. Using techniques like Coefficient of Determination and Predictive Power Score to quantify the relevance of input features. Integrating these models with a majority vote technique to identify the most and least important features for predicting targets. |
Fine-Tuning GPT for Question Answering and Style Completion
Fine-tuning GPT-2 for specific tasks like sentiment analysis, question answering, and text summarization using few-shot learning. Further, adjusting the model for "style completion," enabling it to generate text in a specific style. This study showcases the versatility of GPT-2 in handling various natural language processing tasks, with applications in both research and industry. |
Abstractive Semantic Search by OpenAI
Using OpenAI embeddings for semantic search to improve accuracy by understanding query meaning rather than just keywords. The model generates embeddings for both queries and documents, ranking search results based on similarity. This approach enhances context-aware retrieval, as demonstrated by querying a book to find the most relevant answers. |
Fine-Tuning GPT for Movie Genre Prediction
Fine-tuning the distilgpt2 model (a more efficient version of GPT-2) for movie genre prediction. After pre-training on a large text corpus, the model is adjusted for the specific task of classifying movie genres based on plot summaries. This approach leverages the GPT architecture to enhance text classification, demonstrating its capability to predict movie genres with fine-tuned accuracy. |
Selected Blogs
The blogs featured below cover a wide array of topics in data science, machine learning, and related technologies. They range from practical tutorials, such as creating data science web apps with Streamlit, to in-depth discussions on advanced topics like statistical analysis, clustering, multicollinearity, and text similarity in NLP. Drawing from my hands-on experience, these posts aim to share knowledge and foster learning. Many include Python code, datasets, and notebooks for you to explore and replicate the analyses, while others focus on demystifying methodologies and tools. Dive in to gain practical insights and discover innovative solutions to data science challenges.
Data Science Web App with Streamlit
Building interactive web apps with Streamlit for data visualization and machine learning model deployment. Enabling quick development of web interfaces to showcase insights and predictions. Featuring two customer churn prediction apps: one using a pre-trained model and the other for training a model from scratch. |
Introduction to Hugging Face: Pre-trained Model and Tools
Exploring Hugging Face's contributions to natural language processing (NLP) through its open-source transformers library. Providing access to pre-trained models like BERT, GPT, and T5, ready to be fine-tuned for tasks like sentiment analysis, text classification, and summarization. Simplifying the fine-tuning process to enable researchers and developers to focus on specific applications without the complexity of model training. |
Exploring Clustering Techniques
Examining unsupervised machine learning techniques for grouping unlabeled data through clustering. Covering methods like K-means, DBSCAN, Spectral Clustering, Agglomerative Clustering, and Gaussian Mixture models, applied to both synthetic and real-world data. Discussing the advantages and drawbacks of each technique for effective data segmentation. |
Statistical Analysis in Python
Implementing various statistical analysis techniques in Python, including bootstrapping, confidence intervals, hypothesis testing (z-test, t-test, f-test, chi-square), A/B testing, and effect size. Applying these methods to two real-world datasets: the 2008 US swing state election results and finch beak dimensions. |
Text Similarity in NLP
Exploring text similarity techniques in natural language processing (NLP) to measure the closeness between two text chunks based on surface-level or meaning. Discussing methods like Jaccard similarity, Cosine similarity, Inverse Document Frequency, Glove pre-trained models, and Word Mover Distance. The focus is on both lexical (word-level) and semantic (phrase/paragraph-level) similarity, with an introduction to data processing for NLP. |
Detecting and Addressing Multicollinearity
Examining multicollinearity in multiple regression, where independent variables are highly correlated, affecting the precision and interpretability of model coefficients. Applying methods like p-value for regression coefficients and Variance Inflation Factor (VIF) to detect and address multicollinearity. The analysis includes both realistic and synthetic examples. |
Exploring LangChain for LLM-powered Applications
Exploring the use of LangChain to develop and deploy AI-driven applications using large language models (LLMs). LangChain simplifies the integration of LLMs with other systems for data retrieval and performance monitoring, addressing the complexities of working with diverse LLM architectures, training data, and use cases.. |
Dimensionality Reduction Techniques
Exploring unsupervised dimensionality reduction techniques to reduce the number of input variables while retaining key information. These methods, such as PCA, t-SNE, LLE, and LDA, help manage high-dimensional data, improve visualization, reduce multicollinearity, and mitigate overfitting. The notebook covers the implementation of these techniques to enhance model efficiency and training time. |
Exploratory Data Analysis and Statistical Visualization
Exploring different approaches for exploratory data analysis and statistical visualization, including histograms, CDF, swarm plots, box plots, and kernel density estimation. Discussing probability distributions like binomial, Poisson, normal, and exponential. Focusing on data manipulation and analysis using Pandas to uncover insights and trends in the data. |
Object Oriented Programming in Python
Introducing Object-Oriented Programming (OOP) in Python, focusing on the use of classes and methods. Highlighting key OOP concepts such as inheritance and polymorphism, which provide flexibility and enhance code reusability. |
Copyright 2024. All rights reserved!
Create a free web site with Weebly