Context

Since the era of Generative AI, I’ve tried multiple times to learn AI deeply. This is probably the first time I have been able to. Sharing the below blog for others whom this could help.

Why Now?

AI as a skill is still relevant, there are problems where AI could be applied, Gen AI is not going away anywhere, Vibe Coded applications are making a lot of noise (Yes, Noise).

Since 2022, AI has moved away from being Niche to Foundational. Data is being used for making decisions and its also a great idea to get into Data Mining Slowly.

Personally I do NOT want to JUST use the tools, I want to build few of them myself. That’s why now is the time.

Introduction to Machine Learning

At its core Machine Learning is a subset of Artificial Intelligence (AI) that allows computers to learn from Data and make Predictions or Decisions without being explicitly programmed. The important part to note here is “explicitly programmed“.

Instead of writing a fixed set of rules for a task, ML algorithms identify patterns in historical data and use those patterns to handle new, unseen data. You don’t need a human to be also there to make the decision.

A Simple Analogy:

As a kid I remember playing hours of Contra on an old Atari game station, with one of those 9999999 in 1 games. No one taught us how to play the any game. We’d just insert the cassette, press a few buttons, and inch forward, backward, jump one frame at a time, slowly discovering that you could crouch to dodge bullets or take down enemies. We just learnt it on our own.

Think of how we teach a kid to recognize a dog or a bird. Instead of saying things like “dogs have four legs and bark,” we show them many pictures of dogs. Over time, the kid learns to recognize a dog by identifying patterns.

Machine Learning also works in a similar way. Simple!

High level steps involved with ML

Usually when we start ML, we start with Data and move towards making a prediction/decisions based on the problem being solved. But, before that - we need a problem statement.

Step 1: Define the Problem Statement

Every successful project starts with a clear, well-defined problem statement. ML is no different. Without a Good Problem statement, having the best infrastructure, well defined data, developers and throw in the most advanced algorithms, Nope, it won’t deliver meaningful results.

Problems statements ML could help us with

Determine the next person to get a promotion.
Which movie will the user love to watch next?
Will this person be able to make the next EMI payment?
Is this a Spam Email?
Will India win the next test in the Anderson - Tendulkar Series ?

Clearly understanding the problem to be solved is half battle won.

A problem statement should answer:

What exactly are we solving?
Why does this matter ? Why Now?
What does success look like?
What are we learning from this?

Important: Just because we have a problem, it does not automatically qualify to be solved with ML. There are so many problems that could just be solved using traditional programming ( with some rules/conditions that are also easy to manage). For Example, I would probably apply ML to identify if the pic is of a Cat or a Dog. On the other hand, to build a Calculator or Tax App, i would reach out to traditional Programming Methods ( No need of ML here)

Step 2: Prepare & Understand the Data

Data is the backbone of Machine Learning. For a Machine Learning Model to be working correctly, it needs data that is optimized, bias free and clean. Every aspect of the data is important - Quality & Quantity, vastness of the data etc.

Data, its a prerequisite.

Data determines how good the the outcome of Machine Learning is. If the data is not good, even the best of the algorithms cannot help. That is how Critical Data is to ML.

When we start working with the data, data can be either structured or unstructured or at times semi structured. This data needs to be processed, optimized and cleaned.

2.1 Collect & Clean the Data

Collection: Gather data from multiple sources (databases, APIs, web scraping, surveys, or logs).
Cleaning/Scrub the data
1. Handle missing values (impute with mean/median or remove rows).
2. Remove duplicates and fix inconsistencies.
3. Standardize formats (dates, currency, text).

2.2 Transform & Structure the Data

One Hot Encoding: Convert non numerical values to 0 or 1 (True = 1, False = 0)
Binning/Bucketing: Grouping/Ranging (Scores between 80-100 are grouped as “Distinction”)
Normalization: Scale all values into a fixed range (0 to 1)
Standardization: Center the data around 0 ( Deviations are set towards 1)
Feature Creation: Derive new features (e.g., “Age” from “Date of Birth” or “Purchase Frequency”).

2.3 Explore & Understand the Data (Exploratory Data Analysis)

Once the data is prepared and processed, the next important step is to have a proper sense of data before we start modelling the data. Without this vital step, the outcome will be extremely poor results. This particular step helps us to uncover trends, outliers, missing data, and correlations.

Steps involved are :

Visualize data: Histograms, scatter plots and others.
Check distributions: Are features normally distributed?
Identify outliers & anomalies: Remove or transform them.
Understand relationships: Which features influence the target variable?

2.4 Split the Data

Split Validation : Split the data as 70:20:10 or 80:20.
Cross-Validation: Rotate subsets to ensure the model generalizes well.
Random Shuffling: Reorder data to remove sequence bias.

Step 3: Choose the Right Algorithm

Once data is ready, the next step is to pick an ML algorithm that matches your data type, problem type, and constraints. Machine learning incorporates many mathematical and statistical-based algorithms and choosing the right algorithm(s) for the job is not an easy task.

Algorithms can be picked based on what needs to be done:

Classification —> Logistic Regression, Random Forest, SVM, Neural Networks
Regression —> Linear Regression, Decision Trees, Gradient Boosting.
Clustering —> K-Means, DBSCAN, Hierarchical Clustering.
Sequential Data Analysis —> RNNs, LSTMs.

Depending on the input and outcome you are expecting, choose the algorithm(s).

Step 4: Train the model

The next step is to let the algorithm learn patterns from data.

Optimize for minimal errors (loss function).

Use training curves (loss vs epochs) to detect overfitting or underfitting.

Step 5: Evaluate the Model

Check how well the model generalizes to unseen data

Metrics for Classification: Accuracy, Precision, Recall, F1-score, AUC-ROC.
Metrics for Regression: RMSE, MAE, R².
Confusion Matrix: Break down true positives, false positives, etc.
Cross Validation: Ensure model stability across different data subsets.

More about this in another blog.

Step 6. Fine-Tuning (Optimization)

Now that we understand how the model is working, we can improve the performance by tuning hyperparameters or improving features.

Hyperparameter tuning: Grid Search, Random Search, Bayesian Optimization.
Regularization: L1/L2 to prevent overfitting.
Ensemble learning: Combine models (e.g., bagging, boosting) for higher accuracy.
Feature selection: Remove irrelevant or redundant features.

Again, this needs another blog to go deeper into these terms.

Wrap Up

Generally these are the steps involved in building a Machine Learning Algorithm. After Steps 1 through 6, we deploy the solution to different environment ( Develop, Test, Pre Production and Production) and continuously improve.

Let's Connect

Hi, I’m Sandeep Gokhale, and I'm passionate about building high-performing teams at my company, Techvito and I write about Technology, People, Processes and some more fun stuff.

One of my life’s missions is to do whatever it takes to build world-class products and deliver exceptional client outcomes.

In case you're looking out for a technology partner to accelerate your business goals with clarity, speed, and quality & security, my team and I are here and more than ready to help you make it happen.

Feel free to connect with me on LinkedIn and Twitter.

Until Next time!