لَآ إِلَـٰهَ إِلَّا هُوَ
LA ILAHA ILLA HU
Allah, Your Lord There Is No Deity Except Him.

وَأَنَّا لَمَسْنَا ٱلسَّمَآءَ فَوَجَدْنَـٰهَا مُلِئَتْ حَرَسًۭا شَدِيدًۭا وَشُهُبًۭا
˹Earlier˺ we tried to reach heaven ˹for news˺, only to find it filled with stern guards and shooting stars. (Al-Quran Surah Al-Jinn Aya No 8)
إِلَّا مَنِ ٱسْتَرَقَ ٱلسَّمْعَ فَأَتْبَعَهُۥ شِهَابٌۭ مُّبِينٌۭ
except the one eavesdropping, who is then pursued by a visible flare. (Al-Quran Surah Al-Hijr Aya No 18)

# Python Data Science Machine Learning Asteroid Diameter Prediction Case Study 1

1. Problem Overview
2. Dataset Source And Description
3. Performace Metrics Used
4. Approach For Problem Solving
5. Basic Exploratory Data Analysis
6. Code Snippets And Outputs
7. Future Work
8. Project Code
9. References

Problem Overview
Asteroids are small rocky objects that revolve around the sun like planets.
Even though they orbit around the sun just like our planets do they are much smaller than our planets.
There are millions of Asteroids but most of them mainly live in the Asteroid belt where it lies between Mars and Jupiter.
This problem which we would be dealing here is a regression problem which will help us in predicting an estimate diameter for an Asteroid.
We are doing this because we need to prevent any damage caused by the Asteroid.

Why an Asteroid hits Planet Earth?
Due to Earth’s escape velocity, the minimum impact velocity is 11 km/s with asteroid impacts averaging around 17 km/s on the Earth.
The most probable impact angle is 45 degrees.
Impact conditions such as asteroid size and speed, but also density and impact angle determine the kinetic energy released in an impact event causes the asteroid to hit the earth.

Dataset Source And Description
Dataset Source:
The Source of the dataset can be found here

Dataset Description
For this problem statement, we are using Asteroid.csv for predicting the Asteroid Diameter.
This Dataset Has the following attributes:

1.full_name : Asteroid object with full name and designation.
2.a : Semi-major axis of an Asteroid in AU(Astronomical Units).
3.e : Eccentricity of an Asteroid.
4.i : Inclination of an Asteroid in degrees.
5.om : Longitude of the ascending node of an Asteroid in degrees.
6.w : Argument of the perihelion in degrees.
7.q : perihelion distance of an Asteroid in AU(Astronomical Units).
8.ad : Aphelion distance of an Asteroid in AU(Astronomical Units).
9.per_y : Orbital period in years.
10.data_arc : number of days spanned by the data_arc.
11.condition_code : MPC(Minor Planet Center) ‘U’(Uncertainity) parameter for any Asteroid.
12.G : Magitude slope parameter of an Asteroid.
13.n_obs_used : No of all types of Radar Observations used.
14.H : Absolute Magnitude Parameter in mag.
15.diameter : Asteroid diameter in km.
16.extent : Asteroid’s bi/tri axial ellipsoid dimensions in km.
17.albedo : Geometric albedo.
18.rot_per : Rotational Period of an Asteroid measured in h.
19.GM : Standard gravitational parameter, Product of mass and gravitational constant.
20.BV : Color index B-V magnitude difference in mag.
21.UB : Color index U-B magnitude difference in mag.
22.IR : Color index I-R magnitude difference in mag.
23.spec_B : Spectral taxonomic type(SMASSII).
24.spec_T : Spectral taxonomic type(Tholen).
25.neo : Near Earth Object(flag Y/N).
26.pha : Potentially Hazardous Asteroid(flag Y/N).
27.moid : Earth Minimum orbit Intersection Distance in AU.

Performace Metrics Used
We are using Two Performance metric for this problem statement. They are as follows:

1.R-Squared:
R-Squared(R²) is also known as the coefficient of determination, It is the proportion of variation in Y(dependent or target variable) explained by the independent variables X.
It is the measure of goodness of fit of the model.
If R² is 0.8 it means 80% of the variation in the output can be explained by the input variable.
So, in simple term higher the R², the more variation is explained by your input variable and hence better is your model.
R² can also be said as the ratio between the residual sum of squares and the total sum of squares.

SSR (Sum of Squares of Residuals) is the sum of the squares of the difference between the actual observed value (y) and the predicted value (y^).
SST (Total Sum of Squares) is the sum of the squares of the difference between the actual observed value (y) and the average of the observed y value (yavg).
R² typically lies between the range 0 to 1, where 0 indicate poor fit of the regression line to the data. i.e. no linear relationship between X and Y and 1 indicate perfect fit.

2.Negative Mean Absolute Error(NMAE)
This is the negation of Mean Absolute Error(MAE).Mean Absolute Error is the amount of error in your measurement.
It is the absolute difference between the predicted value and actual value.
It is very essential thing to find MAE while developing any machine learning model.
NMAE typically lies between (-∞,0], where 0 represents no absolute difference error between actual and predicted value.

The Need of MAE
Consider we are predicting the age of a dog.
Actual age of that dog is 6 years.
After applying a Machine Learning model we predicted the age of that same dog as 9 years.
Now we can see the clear difference of 3 years between the ages.
This difference is called absolute error.
Mean of absolute errors of all the observation is calculated be given formula.

Where,
yi = predicted value
y = actual value

We are using 1/n to take mean absolute errors of all observations.
Negative Mean Absolute Error can be written as

Approach For Problem Solving
We will solve the problem statement by the following approach

1.Perform EDA to uncover imbalance between the categorical variables, find the missing values in the dataset if possible.
2.Check the correlation of each categorical and numerical feature w.r.t target variable.
3.Bin the target variable and check the pdfs and cdfs of binned target variable.
4.Removing the useless features that almost all missing values present.
5.Perform train-test-validation splitting and feature engineering and encode all the numerical as well as categorical features.
6.Perform training on all Possible Regression Models and Select the model that gives us the best performance values.

Basic Exploratory Data Aalysis
The Exploratory Data Analysis involved the following steps

a)Checking the data
The dataset has total 839736 datapoints along with 27 features. Given below is a basic snapshot of our data.

Our dataset also has missing values.

We are deleting the features that has more than 90% of missing values from our data.
There are 9 columns that has more than 90% of the data as missing values and shape of our dataset is (839736,27).
After deleting those features, The shape is now changed to (839736,18).
Our dataset has now 3 Categorical, 15 Numerical features now.
After removing those features, it can be found out that some features in our dataset has still some missing values present.
We will use two possible Approaches here:

1.Median and Mode Based Imputation: Fill the missing values as median and mode for any particular feature.
This is done if our features has less than one percent of missing values.
However, we have considered a to use median based imputation for albedo feature.
2.Build an ML Model to predict the missing values and fill those missing values with predicted ML Output.

Using Median Based Imputation
We have used median based imputation for imputing missing values of Semi-major axis(a), aphelion distance(ad) and Orbital Period(per_y).

Using Mode Based Imputation
First of all, we count the number of values that are present inside the condition_code feature.
The Snapshot For that is given below.

It can be observed that this feature has mismatched values and the dtype for this feature is specified as an object.
So, we handle all the mismatched values by replacing them with the condition_code based on specified data_arc.
We are considering data_arc because it is an important feature which helps us in determining condition_code(Orbit’s Uncertainity Parameter).

For all E condition_code, most of the data_arcs were found to be less than 30. So, here replace all the values with 9 integer value.

Now, for the remaining imputations, use the mode 0 to impute all the remaining missing values and change the the value of dtype to int.

After imputation, count the values

Using ML Models to impute missing values
In order to impute missing values for data_arc, H and moid, we will build a ML model and impute those missing values with predicted values of our ML model
First we find out the correlation of the missing value feature w.r.t to other features. Select the features that has higher correlation with missing value feature.
Perform a splitting such that all the missing values comes under the test data.
Finally train the model and fill the missing values with predicted outputs.

For data_arc

For absolute magntiude(H)

For moid

SimpleImputer method of sklearn library is also used for imputing albedo feature.

In order to impute missing values in neo and pha feature, we will do imputation based on conditional statements.

1.If q(Perihelion distance)<=1.3, the asteroid is a NEO(Near Earth Object).
2.If moid(Earth Minimum orbit Intersection Distance)<=0.05 and H(Absolute Magnitude)<=22, the asteroid is PHA(Potentially Hazardous Asteroid).

Now, all of our missing value imputations are completed.
It is now time for Visualizing and Plotting of features.

1.Univariate :
For this, take a Single feature and plot it on the graph.
If the given feature is numerical, perform a boxplot, histograms, distplot and get a brief description about that particular feature.
Similarly, this same methodology can be performed on remaining numerical features.
If the given feature is categorical, Find out how many times those categories are occurring and use countplots for plotting those categorical features.

Boxplot

Distplot(Histograms are included too)

Similarly, this same methodology can be performed on remaining numerical features.
If the given feature is categorical, Find out how many times those categories are occurring and use countplots for plotting those categorical features.
Refer to the snapshots below for univariate EDA of neo and condition_code features.

NEO

Orbit Condition Code

2.Bivariate Analysis
For Bivariate analysis, we are only plotting and visualizing only numerical features.
First, draw the correlation heatmap. Refer The below Snapshot for Heatmap

If correlation value is greater or equal to 0.5, try to visualize and find the relationship between the given numerical features using the scatterplot.

data_arc vs condition_code

diameter vs Absolute Magnitude(H)

3.Multivariate Analysis
For multivariate analysis, we have used pairplots keeping categorical features as hue.
Refer the below Snapshot for pairplots.

Binning the target variable for pdfs, cdfs and to find correlation with categorical variables.
Binning Is Based On Following Conditions
1.If the diameter is less than equal to 10 km, Then the diameter is binned as small.
2.If the diameter is greater 10 km and less than equal to 100 km, Then the diameter is binned as small.
3.If the diameter is greater 100 km and less than equal to 500 km, Then the diameter is binned as large.
4.If the diameter is greater 500 km, Then the diameter is binned as very large.
5.Else diameter is binned as missing if it does not have any numerical values.

Refer The below Snapshot.

Finding the correlation of categorical variables w.r.t binned target variable
We have used Chi squared test to find the correlation between the binned target variable and categorical features.
Chi-square test finds the probability of a Null hypothesis(H0).

1.Assumption(H0): The two columns are NOT related to each other
2.Result of Chi-Sq Test: The Probability of H0 being True.
3.It can help to understand whether both the categorical variables are correlated with each other or not.

If Significance value(a) is >=0.005 in Chi squared, it can be said that the categorical variables are not correlated with each other.
Refer to the Snapshots given below.

Plotting PDFs and CDFs
Based on binned diameter, PDFS and CDFS of features are plotted.

PDF

CDF

Code Snippets And Outputs
After imputing missing values and performing EDA, It’s time to build ML models to predict the diameter feature.
First change the datatype of diameter feature to float.
Now, Build a Correlation Matrix of all features w.r.t diameter feature.
Refer to the Snapshot below

Now, Split independent and dependent features and perform train-test-validation splitting of the keeping stratified as binned diameter feature.
Refer to the Snapshot below

After that encode the train,test and validation data.

Now, Build a Linear Regression Model. Before training this model, we need to check some assumptions. In this case, we are using VIF(Variance Inflation Factor).

1.It is predicted by taking a variable and regressing it against every other variable.
2.VIF starts at 1 and has no upper limit.
3.VIF = 1, no correlation between the independent variable and the other variables.
4.VIF between 1 to 5 indicates lesser multi-collinearity independent variable and the other variables.
5.VIF exceeding 5 or 10 indicates high multi-collinearity between this independent variable and the others.

Remove the features which has higher VIF and check the VIF again.

Now train the Regression Model and calculate the performance Metrics of those models.

Linear Regression

Ridge Regression

Lasso Regression

ElasticNet Regression

Now, Train the models using Ensembles.
For Ensembles, original data is taken i.e. the data which has all features present.

Random Forest Regression

XGBoost Regression

Checking the Performance of all Models
R Squared

Negative Mean Absolute Error

Conclusion
After looking at performance of various models above, It can be observed that XGBoost performs well so we have used hyper parameter tuning in order to fine tune the model further more.
The final conclusion is XGBoost will be the best fit for estimating the asteroid diameter.

Future Work
1.In order to get the best estimation for Asteroid diameter, we can also use Artificial Neural Networks.
2.In the case of Regression, Artificial Neural Networks predict an output variable as a function of the inputs.
3.The input features (independent variables) can be categorical or numeric types, however, for regression Artificial Neural Networks, we require a numeric dependent variable.

Project Code
You can view all the my project codes github respository.