Andreas Nikolaidis
February 2022 (Edited: September 2023)
- Introduction
- Exploratory Analysis
- Correlations & Descriptive Statistics
- Principal Component Analysis (PCA)
- Cross Validation & Regression Analysis
- Conclusion
In this project, I will aim to analyze the stats of all Pokemon in Generations 1 - 9, and calculate some statistics based on a number of factors.
In the following sections, I will walk through my process of extracting and analyzing the information using pandas
, creating some visualizations and modeling using scikit-learn
.
Start by importing all the necessary packages into Python:
import numpy as np
import pandas as pd
# Viz
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set_style('whitegrid')
%matplotlib inline
# Import for Linear Regression
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
Read Data File:
df = pd.read_excel("pokemon.xlsx")
df.head()
Create a separate dataframe including just the necessary stats:
df_stats = df[["Name","HP","Attack","Defense","SP_Attack","SP_Defense","Speed"]]
Although each stat is important in it's own right, the total value of all stats is what determines the category of a pokemon, therefore let's concatenate a column into the df
that sums up the total values:
df['total'] = df.HP + df.Attack + df.Defense + df.SP_Attack + df.SP_Defense + df.Speed
df.head(3).style.bar(subset=['Total', 'HP', 'Attack', 'Defense', 'SP_Attack', 'SP_Defense', 'Speed'])
#Create a dataframe of just the main stats excluding other 'non important' variables
df_stats = df[["Name","HP","Attack","Defense","SP_Attack","SP_Defense","Speed"]]
Visuals
Now let's view the range of total stats by each generation:
plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(df, x='Gen', y='Total', scale='width', inner='quartile', palette='pastel')
#palette: https://seaborn.pydata.org/tutorial/color_palettes.html?highlight=color
plt.title('Violin Plot of Total Stats by Generation', fontsize=22)
plt.show()
In the above violinplot we can see that each generation has quite a different range of total stats with Gens IV, VII, & VIII having the longest range, while Gen V had a relatively tight range of stats. All Generations from IV onwards had higher medians than the first 3 generations.
Looking at individual stats, Speed is one of (if not THE) most important stat in competitive play, so let's examine which generations had the best overall speed stats.
plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(df, x='Gen', y='Speed', scale='width', inner='quartile', palette='pastel')
plt.title('Violin Plot of Speed Stats by Generation', fontsize=22)
plt.show()
Here we can clearly see Generation VIII has some of the fastest pokemon ever seen in games. Let's create a function to return the top 10 fastest pokemon in Gen VIII and their respective speed stat values:
def top_n(df, category, n):
return (df.loc[df['Gen'] == 'VIII'].sort_values(category, ascending=False)[['Name','Gen',category]].head(n))
print('Top 10 Pokemon Speed')
top_n(df, 'Speed', 10)
Those are definitely some fast pokemon!
Let's now see if we can get any indication of whether a particular pokemon's type has an advantage over others in total stats.
#custom colors based on color of types from games
types_color_dict = {
'grass':'#8ED752', 'fire':'#F95643', 'water':'#53AFFE', 'bug':"#C3D221", 'normal':"#BBBDAF", \
'poison': "#AD5CA2", 'electric':"#F8E64E", 'ground':"#F0CA42", 'fairy':"#F9AEFE", \
'fighting':"#A35449", 'psychic':"#FB61B4", 'rock':"#CDBD72", 'ghost':"#7673DA", \
'ice':"#66EBFF", 'dragon':"#8B76FF", 'dark':"#1A1A1A", 'steel':"#C3C1D7", 'flying':"#75A4F9" }
plt.figure(figsize=(30,12), dpi=80)
sns.violinplot(df, x='Primary', y='Total', scale='width', inner='quartile', palette=types_color_dict)
plt.title('Violin Plot of Total Stats by Type', fontsize=15)
plt.show()
The dragon type definitely has quite a high upper interquartile range compared to other types, which makes sense as many legendary are dragon type. Meanwhile water & fairy types seem to have quite a large variance in total stats.
Let's see what the most common type of pokemon is:
Type1 = pd.value_counts(df['Primary'])
sns.set()
dims = (25,8)
fig, ax=plt.subplots(figsize=dims)
BarT = sns.barplot(df, x=Type1.index, y=Type1, palette=types_color_dict, ax=ax)
BarT.set_xticklabels(BarT.get_xticklabels(), rotation= 90, fontsize=12)
BarT.set(ylabel = 'Frequency')
BarT.set_title('Distribution of Primary Pokemon Types')
##Annotate values
for bar in BarT.patches:
BarT.annotate(format(bar.get_height(), '.0f'),
(bar.get_x() + bar.get_width() / 2,
bar.get_height()), ha='center', va='center',
size=10, xytext=(0, 8),
textcoords='offset points')
FigBar = BarT.get_figure()
We can see that the water and normal type pokemon are the most frequently appearing 'primary' types in the game. Interesting to see Flying types as lowest however it makes sense when we only look at primary types as majority of pokemon that are dual types with Flying usually have flying as their 'secondary' type. Meaning, a Pokemon is normally not "Flying/Normal", it's most commonly: "Normal/Flying" for example.
Let's see how many pokemon are mono types vs dual-types so we can get a better sense of whether primary is sufficient.
A simple method would be to do a count over but lets create a chart:
labels = ['Mono type pokemon', 'Dual type pokemon']
sizes = [monotype, dualtype]
colors = ['lightskyblue', 'lightcoral']
patches, texts, _ = plt.pie(sizes, colors=colors, autopct='%1.1f%%', startangle=90, explode=(0,0.1))
plt.legend(patches, labels, loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.axis('equal')
plt.title('Dual-Type Ratio', fontsize=12)
plt.tight_layout()
plt.show()
Looks like there's actually more dual types than mono-types
Aside from types, there are also 5 categories of pokemon: Regular, Pseudo-Legendary, Sub-Legendary, Legendary and Mythical. (There are of course also pre-evolutions, final evolutions, mega-evolutions etc.. but for the purposes of this analysis we will just bundle those together under 'regular' along with Pseudo-Legendary which are regular pokemon that have generally higher stats of 600 total. As for Sub Legendaries, Legendaries and Mythical - these pokemon typically exhibit 2 types of traits:
- Rarity: There is usually only 1 of those pokemon available in every game (some may not even be obtainable in certain games)
- Stats: These pokemon generally have much higher stats than the average 'regular' pokemon.
Let's create a diverging bar to determine the rate at which legendary pokemon appear in each generation:
#Sub-Legendary, Legendary or Mythical:
df.loc[df["is_sllm"]==False,"sllmid"] = 0
df.loc[df["is_sllm"]==True,"sllmid"] = 1
# calculate proportion of SL, L, M #
sllm_ratio = df.groupby("Gen").mean()["sllmid"]
sllm_ratio.round(4)*100
sns.set_style('darkgrid')
df_plot = pd.DataFrame(columns=["Gen", "Rate", "colors"]) # Use square brackets [] here
x = sllm_ratio.values
df_plot["Gen"] = sllm_ratio.index
df_plot['Rate'] = (x - x.mean()) / x.std()
df_plot['colors'] = ['red' if x < 0 else 'green' for x in df_plot['Rate']]
df_plot.sort_values('Rate', inplace=True)
df_plot.reset_index(inplace=True)
plt.figure(figsize=(10, 10))
plt.hlines(
y=df_plot.index, xmin=0, xmax=df_plot.Rate,
color=df_plot.colors,
alpha=.4,
linewidth=5)
plt.gca().set(xlabel='Rate', ylabel='Gen')
plt.yticks(df_plot.index, df_plot.Gen, fontsize=12)
plt.title('Diverging Bars Rate', fontdict={'size': 20})
plt.show()
Seems like Gen 7's Alola region has a huge volume of these 'legendaries & mythical' pokemon, which after digging further into it makes perfect sense given the introduction of a plethora of legendaries called ultra beasts which were only ever introduced in that generation.
Let's move to explore some correlations between stats.
from pandas import plotting
plotting.scatter_matrix(df_stats, figsize=(10, 10))
plt.show()
corrcoef = np.corrcoef(df_stats.iloc[:, 1:7].T.values.tolist())
plt.imshow(corrcoef, interpolation='nearest', cmap=plt.cm.magma)
plt.colorbar(label='correlation coefficient')
tick_marks = np.arange(len(corrcoef))
plt.xticks(tick_marks, df_stats.iloc[:, 1:7].columns, rotation=90)
plt.yticks(tick_marks, df_stats.iloc[:, 1:7].columns)
plt.tight_layout() #clean
###----Correlations
Base_stats = ['Primary','Secondary','Height','Weight','HP','Attack','Defense',
'SP_Attack','SP_Defense','Speed','is_sllm']
df_BS = df[Base_stats]
plt.figure(figsize=(14,12))
heatmap = sns.heatmap(df_BS.corr(), vmin=-1,vmax=1, annot=True, cmap='Blues')
heatmap.set_title('Correlation Base Stats Heatmap', fontdict={'fontsize':15}, pad=12)
plt.show()
Some other charts showing stat correlations:
Let's take a look at diverging bars based on Attack to see what pokemon type stands out in that specific stat:
attack_byType = df.groupby("Primary").mean()["Attack"]
df_plot2 = pd.DataFrame(columns=["Type","Attack","colors"]) #square brackets
x = attack_byType.values
df_plot2["Type"] = attack_byType.index
df_plot2['Attack'] = (x - x.mean())/x.std()
df_plot2['colors'] = ['red' if x < 0 else 'green' for x in df_plot2['Attack']]
df_plot2.sort_values('Attack', inplace=True)
plt.figure(figsize=(14,14), dpi=80)
plt.hlines(y=df_plot2.Type, xmin=0, xmax=df_plot2.Attack)
for x, y, tex in zip(df_plot2.Attack, df_plot2.Type, df_plot2.Attack):
t = plt.text(
x, y, round(tex, 2),
horizontalalignment='right' if x < 0 else 'left',
verticalalignment='center',
fontdict={'color':'red' if x < 0 else 'green', 'size':15})
plt.yticks(df_plot2.Type, df_plot2.Type, fontsize=12)
plt.title('Diverging Text Bars of Attack by Type', fontdict={'size':20})
plt.xlim(-3, 3)
plt.show()
Seems like FIGHTING type is the most common
Let's take a look at distribution of stats:
df.describe()
labels = ["Defense", "Attack"]
dims = (11.7, 8.27) #a4
fig, ax = plt.subplots(figsize=dims)
Defhist = sns.distplot(df['Defense'],color='g', hist=True, ax=ax)
Atthist = sns.distplot(df['Attack'],color='r', hist=True, ax=ax)
Atthist.set(title='Distribution of Defense & Attack')
plt.legend(labels, loc="best")
FigHist = Atthist.get_figure()
By calling on the summary statistics, we can see that the assumption about the variance and skewness of both plots was correct. The ‘std’ metric (standard deviation) of 'Sp.Atk is larger than that of the Sp.Def. Skewness is determined by the positions of the median (50%) and the mean. Since in all instances (Attack, Defense, Sp.Attack and Sp.Defense) the mean is greater than the median, it is emphasised that the distribution is right-skewed (positively skewed).
Let's take a look at PCA and plot Pokemon in a two-dimensional plane using the first and second principal components. PCA is a type of multivariate analysis method that is often used as a dimensionality reduction method and sometimes regarded as a type of unsupervised ML, revealing the structure of the data itself.
In this data, the characteristics of all Pokemon total stats are represented by 6 types of "observed variables" (x1, x2, x3, x4, x5, x6). (As explained earlier - HP, Attack, Defense, SP Attack, SP Defense & Speed) - these 6 variables are used as explanatory variables. On the other hand, the synthetic variable synthesized by PCA is called "principal component score" and is given by a linear combination as shown in the following equation:
yPC1 = a1,1 x1 + a1,2 x2 + a1,3 x3 + a1,4 x4 + a1,5 x5 + a1,6 x6
In principal component analysis, the larger the eigenvalue
(= variance of the principal component score), the more important the principal component score is.
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df.iloc[:, 7:13])
feature = pca.transform(df.iloc[:, 7:13])
plt.figure(figsize=(8,8))
plt.scatter(feature[:, 0], feature[:, 1], alpha=0.8)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid()
#plt.show()
import matplotlib.ticker as ticker
plt.gca().get_xaxis().set_major_locator(ticker.MaxNLocator(integer=True))
plt.plot([0] + list( np.cumsum(pca.explained_variance_ratio_)), "-o")
plt.xlabel("# PC")
plt.ylabel("Cumulative contribution ratio")
plt.grid()
#plt.show()
Let's see if we can determine what makes a pokemon 'LEGENDARY'
pca = PCA()
pca.fit(df.iloc[:, 7:13])
feature = pca.transform(df.iloc[:, 7:13])
plt.figure(figsize=(8,8))
for binary in [True, False]:
plt.scatter(feature[df['is_sllm'] == binary, 0], feature[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc = 'best')
plt.grid()
plt.show()
Although it's not 'perfect' we can clearly see that when the first principal component (PC1) reaches 50, we start to see a significantly higher concentration of legendary pokemon. Now, let's illustrate how much PC1 actually contributes to the explanatory variable (parameter) with a loading plot.
Assuming that PC1 is actually a strong indicator of whether or not a pokemon is classified as legendary, sub-legendary or mythical, it seems like Attack is one of the strongest indicators out of all stats (followed by Special Attack)
In PCA, we synthesized the "principal component" yPC1 which is a linear combination of the weight matrix (eigenvector) a for the explanatory variables. Here, define as many principal components as there are explanatory variables.
from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=2, max_iter=500)
factors = fa.fit_transform(df.iloc[:, 7:13])
plt.figure(figsize=(12, 12))
for binary in [True, False]:
plt.scatter(factors[df['is_sllm'] == binary, 0], factors[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")
plt.legend(loc = 'best')
plt.grid()
plt.show()
In this instance, the determining factor of a 'legendary' is whether or not the sum of factor 1 and factor 2 exceeds a certain level, but it seems that it is slightly biased toward the larger factor 2. So which parameters do factor 2 and factor 1 allude to?
plt.figure(figsize=(12, 12))
for x, y, name in zip(fa.components_[0], fa.components_[1], df.columns[7:13]):
plt.text(x, y, name)
plt.scatter(fa.components_[0], fa.components_[1])
plt.grid()
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")
plt.show()
Interestingly, it seems that BOTH factor 1 & 2 allude to DEFENSE
X = df.iloc[:, 7:13]
y = df['Total']
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)
print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(X, y))
Let's see if we can predict the "Defense" stat with these 4 variables
X = df.iloc[:, [7, 10, 11, 12]]
y = df['Defense']
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)
print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(X, y))
'Defense' = (0.16 * HP + -0.04 * Sp.Attack + 0.54 * Special Defense + -0.11 * Speed) + 33.4(intercept)
That's the relationship to predict Defense. Which makes sense for the most part as typically Special Defense correlates to Defense in games. HP being low/neutral is a bit surprising. However, the coefficient of determination is so small that this may not be a reliable model..
Since we saw earlier that Defense is a huge contributing factor to determining whether a pokemon is classified as 'legendary', let's use the rest of the stats to see if we can predict Defense and which stats are best contributors for determining this stat.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination(train)= ", regr.score(X_train, y_train))
print("Coefficient of Determination(test)= ", regr.score(X_test, y_test))
Xs = X.apply(lambda x: (x-x.mean())/x.std(), axis=0)
ys = list(pd.DataFrame(y).apply(lambda x: (x-x.mean())/x.std()).values.reshape(len(y),))
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(Xs, ys)
print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(Xs, ys))
pd.DataFrame(regr.coef_, index=list(df.columns[[7, 10, 11, 12]])).sort_values(0, ascending=False).style.bar(subset=[0])
It seems that Special Defense is very important in predicting "Defense!!"
It seems that Special Defense is the best determinator in predicting "Defense". This makes perfect sense as usually a High Defensive pokemon is usually also highly special defensive focused. Additionally, these pokemon are typically much slower which makes sense that 'Speed' would be negative. Additionally most 'non-defensive' pokemon could get 1HKO'ed by a strong move which is why Speed also plays a huge roll for those pokemon. The faster you are the higher the chance that you will attack first and survive. (Pretty basic for most Pokemon players >.>)
Overall this was a way of exploring different pokemon traits and taking into account multiple factors. There's PLENTY more we can look into such as 'strengths', 'weaknesses' etc..which I may do at some point, but I hope you all enjoyed this, and thanks for reading all the way through! If you spot any errors or have any suggestions for visuals or further analysis, please feel free to drop a message!