Data Analysis 101: My beginner project.

Data Analysis 101: My beginner project.

What we'll be doing today.

We will follow these steps on Given Dataset

  1. Trying to make sense of the data.
  2. Checking if there's any dirty data.
  3. Visualizing the clean data.
  4. Final Conclusion.

Step 1: Understand the data

  • First we will load the Data
  • We will create a rough Flowchart of Insights we can get from the Data
  • Move to next Step
# First of all we'll load all the libraries

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

Then we'll go ahead and read the source file from which we'ed be getting the data.

You can download the .csv file from here - github.com/realharshbhatt/Pythonproject/blo..

df=pd.read_csv('/Projects/Stores/Stores.csv')
df.head()

(Note - change the path to where you've stored the .csv file.)

Store ID     Store_Area     Items_Available     Daily_Customer_Count     Store_Sales
0     1     1659     1961     530     66490
1     2     1461     1752     210     39820
2     3     1340     1609     720     54010
3     4     1451     1748     620     53730
4     5     1770     2111     450     46620

Step-2: Data Cleaning

To find the missing and in correct values we generally use 3 Functions

  • df.isnull( ).sum( ): It will give you total NULL values in each column
  • df.info( ): This is very useful, This will give you all information like 'Non-Null Values' & 'Data Type' for each column
  • df.unique( ): Gives you all unique values of a function
df.isnull().sum()

The Output we get would look something like this -

Store ID                0
Store_Area              0
Items_Available         0
Daily_Customer_Count    0
Store_Sales             0
dtype: int64

For more info about the columns, we will use info() function

df.info()

We can see that all the columns data is integer type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 896 entries, 0 to 895
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Store ID              896 non-null    int64
 1   Store_Area            896 non-null    int64
 2   Items_Available       896 non-null    int64
 3   Daily_Customer_Count  896 non-null    int64
 4   Store_Sales           896 non-null    int64
dtypes: int64(5)
memory usage: 35.1 KB

Cleaning the individual data.

We will use unique() function to find all the unique values in store area column.

df.Store_Area.unique()

Below we can see all the unique values in the column.

array([1659, 1461, 1340, 1451, 1770, 1442, 1542, 1261, 1090, 1030, 1187,
       1751, 1746, 1615, 1469, 1644, 1578, 1703, 1438, 1940, 1421, 1458,
       1719, 1449, 1234, 1732, 1475, 1390, 1642, 1715, 1439, 1250, 1331,
       1784, 1375, 1871, 1174, 1839, 1270, 1435,  965, 1665, 1780, 1009,
       1227, 1769, 1660, 1472, 1408, 1514, 1565, 1074, 1864, 1570, 1417,
       1734, 1470, 1761, 1756, 1704, 2011, 1310, 1544, 1707, 1881, 1416,
       1631, 1318, 1692, 1152,  891, 1468, 1539, 1635, 1267, 1720, 1462,
       1431, 1441, 1572, 1287, 1931, 1252, 1238, 1479, 1590, 2169, 1838,
       1385, 1921, 1975, 1853, 1816, 1785, 1579, 1096, 1919, 1262, 1374,
       1309, 1207, 1929, 1573, 1415, 1162, 1485, 1897, 1607, 1909, 1274,
       1157, 1712, 1500, 1682, 1525, 1947, 1164, 1787, 1718, 1365, 1368,
       1342, 1076, 1396, 1713, 1370, 1667, 1638, 1581, 1795, 1179, 1978,
       1688, 1214, 1504, 1498, 1229, 1936, 1369, 1662, 1548, 1649, 1393,
       1450, 1613,  775, 1275, 1740, 1372, 1414, 2044, 1823,  955, 1465,
       1232, 1481, 1343, 1007, 1762, 1527, 1356, 1536, 1605, 1626, 1612,
       1923, 1702, 1398, 1437, 1524, 1302, 1666, 1391, 1778, 1652, 1841,
       1496, 1148, 1321,  992, 1540, 1807, 1526, 1406, 1575, 1488, 1658,
       1863, 1604, 1620, 1251, 1647, 1829, 1852, 1699, 1325, 1350, 1347,
       1397, 1245, 1366, 1378, 1767, 1184, 1257, 1329, 1557, 2007, 1185,
       1657, 1294, 1296, 1733, 1641, 1373, 1550, 1583, 1428, 1648, 1025,
       2001, 1145,  913, 1199, 1875, 1153, 1240, 1381, 1701, 1206, 1476,
       1189, 1837, 1319, 1617, 1517, 1764, 1855, 1327, 1533, 1856, 1000,
       1313, 1494, 1386, 1979, 1057,  902, 1314, 1513, 1305, 1180, 1142,
       1471, 1075, 1585, 1577, 1092, 1523, 1614, 1566, 2019, 1766, 1293,
       1236, 1332, 1208, 1190, 1448, 1771, 1510, 1808, 1640, 1060, 1633,
       1222, 1619, 1624, 1887, 1320, 1455,  966, 1922, 1678, 2004, 1954,
       1362, 1886, 1291, 1584, 1445, 1433, 1269, 1798, 1015, 1495, 1759,
       1219, 1571, 1404, 1124, 1484, 1111, 1078, 1876, 1412, 1121, 1691,
       1599, 1454, 1555, 1554, 1491, 1487, 1339, 1509, 1264, 1905, 1209,
       1546, 1689, 2063, 1848, 1480, 1576, 1948,  896, 1625, 1303, 1410,
       1432, 1891, 1322, 1440, 1280, 1161, 1175, 1395, 1443, 1247, 1788,
       1138, 1709, 1777, 1618, 1311, 1249, 1744, 1297, 1908, 1721, 1243,
       1307, 1628, 1556, 1768, 1685, 1474, 1794, 1086, 1501, 1353, 1165,
       1845, 1172, 1436, 1738, 2229, 1490, 2015, 1611, 1430, 1543, 1836,
       1463, 1213, 1244, 1745,  933, 1675, 1842, 1316, 1760, 1323, 1230,
       1167, 1429, 1609, 1109, 1191, 1335, 1382, 1588, 1918, 1334, 1832,
       1271, 1735, 1627, 1351, 1520, 1537, 1622, 1001, 1857, 1552, 1700,
       1482, 1192, 1601, 2214, 1141,  869, 2049, 1883, 1723,  932, 1137,
       1231, 1237, 1371, 1748, 1212, 1466, 2026, 1772, 1511, 1834, 1143,
       1586, 1860, 1424, 1710, 1553, 1814, 1847,  946, 1563, 1664, 1241,
       1690, 1289, 1608, 1344, 1529, 1677, 1426, 1503, 1971, 1383, 1508,
       1411, 1545, 1505, 1422, 1564, 1568, 1933, 1593, 1812, 1580, 1100,
       1534, 1299, 1989, 1809, 1193, 1348, 1132, 1427, 1211, 1717, 1068,
       1755, 1591, 1797, 1519, 1637, 1203, 1884, 1160, 1317, 1560, 1645,
       1225, 1655, 1872, 1434, 1791, 1693, 1502, 1360, 1246, 1990,  967,
       1582, 1827, 1118, 1799, 1493, 1033, 1854, 1937, 1743, 1098, 1358,
       1336, 1418,  854, 1361, 1085, 1453, 1389, 1486, 1464, 1197, 1878,
       1150, 1562, 1596, 1119, 1116, 1934, 1235, 1497, 1444, 1477, 1041,
       1226, 1489, 1549, 1073, 1473, 2013, 1561, 1574, 1388, 1349, 1862,
       1200, 1722, 1850, 1447, 1679,  994, 1354, 1597, 1873, 1218, 1822,
       1716, 1072, 1330, 1425, 1363, 1671, 1379, 1724,  986, 1651, 1916,
       1528, 1409, 1419, 1535,  919, 2067, 1977, 1653, 1939, 1846,  780,
       1775, 1800, 1859, 1259, 1205, 1792, 1263, 1819, 1357, 1156, 1387],
      dtype=int64)

Nothing looks out of place, so now we can move ahead to create a visualization out of it. But before we move forward, if you'd like you can go through the articles that i've attached below if you want to know more details about data cleaning.

  1. Complete Guide to Data Cleaning with Python - (medium.com/analytics-vidhya/complete-guide-..)

  2. Data Cleaning with Python - (medium.com/bitgrit-data-science-publication..)

Ok then, let's get moving, shall we..!

Step-3: Data Visualization

Let's begin with getting a glimpse of how many columns we have to work with.

df.columns

This will show us the Titles of all the columns in our data set.

Index(['Store ID ', 'Store_Area', 'Items_Available', 'Daily_Customer_Count',
       'Store_Sales'],
      dtype='object')

Now let's make our 1st Viz. Trying to find the relation between Store ares and Items available in the store.

plt.figure(figsize=(6,6))
sns.scatterplot(x='Items_Available',y='Store_Area', data=df)

Here we can see that there's a clear co-relation between store area and number of items available.

1st viz.png You can create various different types of graphs using seaborn library, some of those graph types are:-

  • Heatmap
  • Histogram
  • Bar Plot
  • CatPlot (Categorical Plot)
  • Density Plot
  • Joint Distribution Plot

To know more about types of graphs, you can go through this article:- (towardsdatascience.com/14-data-visualizatio..)

To Plot multiple graph at once, we use pairplot

df1=df
df1.drop(['Store ID '],axis=1,inplace=True)
plt.figure(figsize=(6,6))
sns.pairplot(df1)

From the graphs shown below, we can tell that there is no direct relation between any other Parameter other than store area and items available.

2nd viz.png

If you want to learn more about creating visualization using seaborn, you can check out this article that explains viz creating using seaborn in detail using Pokemon dataset:- (elitedatascience.com/python-seaborn-tutorial)

This was a basic Data Analysis project i made to help me get familiar with the basics.

Thank You!