What I learned from the 10-hour data science course - Data Analysis with Python Course

I will be watching the 10-hour course from FreeCodeCamp, https://www.youtube.com/watch?v=GPVsHOlRBBI, so I want to record what I learned from it.

Conda

Conda is an open-source package manager made for Python. If you have two python projects, each requiring a different version of Python and Python packages, Conda is for you. Conda can create, save, load and switch between environments on your local computer, so you can work on your two projects without having to reinstall Python every time you switch projects.

Jupyter Notebook

Jupyter notebook is a notebook where you can write notes and run code. It is useful for data science because codes are organized in cells, so you can run individual cells one at a time. Contents inside variables are also kept after you ran a cell so that you can use them repeatedly without having to restart the entire program. Without a Jupyter notebook, when I made a change, I have to restart the entire program because variables are not stored, which is very inefficient. But from now on I will use a Jupyter notebook and be more efficient.

It also helps me with writing my blog, since its structure is very similar.

Common keyboard shortcuts in Jupyter Notebook

Keyboards shortcuts are very useful to speed up things, so I went ahead to find some useful keyboard shortcuts from this article before starting:

Shift + Enter run the current cell, select below
Alt + Enter run the current cell, insert below
Ctrl + S save and checkpoint

There are two modes, one is the default mode, which is when you just loaded in the document and are not editing anything, in there you can:

A insert cell above
B insert cell below
D, D (press the key twice) delete selected cells
Z undo cell deletion
Y change the cell type to Code
M change the cell type to Markdown
Enter take you into edit mode

Inside edit mode, where you edit code cells:

Esc take you back to default mode
Tab code completion or indent
Shift + Tab tooltip

Numpy

np.genfromtxt() can read csv file and return a numpy array
np.savetxt() can store a numpy array from a csv file
commonly used functions include:
Mathematics: np.sum, np.exp, np.round, arithemtic operators
Array manipulation: np.reshape, np.stack, np.concatenate, np.split
Linear Algebra: np.matmul, np.dot, np.transpose, np.eigvals
Statistics: np.mean, np.median, np.std, np.max
numpy supports array broadcasting, which allows arithmetic operations between two arrays with different numbers of dimensions but compatible shapes

Pandas

pandas' main data type is DataFrame, which is like an Excel spreadsheet
we can use the .info() method to view the basic information about the dataframe
we can use the .describe() method to see statistical information about the numeric data within the dataframe
another important data type is Series, which is like an array. You can use the .index method to get the indexes as a list, which is very useful when you want to plot graphs of a series
we can pass in a list of columns to create a view of a data frame, like reduced_df = df[['column1', 'column3']], then the resulting dataframe will only have the two columns, but note that this is simply a view, and modifying values here will change the original dataframe as well, you need to use .copy() to create a new dataframe
to view data in a dataframe, we can use .head() to show first few items, .tail() to show last few items, and .sample() to show a random item
we can sort the dataframe by value by using covid_df.sort_values() where we pass in the column name
we can convert data to date using the pd.to_datetime() function
we can use the .groupby() function passing in a column name to group the data, then we can select some columns, and use.sum() or .mean() to calculate the value for the different groupings
we can also merge dataframes, by calling .merge() on a dataframe, passing in the dataframe, and on which column
to write back to csv, we use .to_csv() passing in the file name

Matplotlib and Seaborn

in a Jupyter notebook, we can use %matplotlib inline after importing matplotlib to ensure that our plots are shown and embedded within the Jupyter notebook itself
the most basic chart is a line chart, which can be plotted with plt.plot()
we can add labels to the chart with plt.xlabel() and plt.ylabel(), and legend with plt.legend()
it seems plots only show at the end of code execution, that's why we can do plt.plot(), then change certain stuff like labels afterwards, then at the end of the execution, the plot will be shown with the correct data and labels
we can even plot multiple lines on the same graph
seaborn is a statistical graphics library built on matplotlib, and is commonly imported as sns, because Samuel Norman "Sam" Seaborn is a fictional character portrayed by Rob Lowe on the television serial drama 'The West Wing' according to this StackOverflow post, my guess is the creator loved this TV show.
we can plot a scatterplot using sns.scatterplot() passing in the x and y, and optionally a hue
we can plot a histogram with plt.hist(), and we can customize the bins to even uneven bins if we want
we can stack histogram by passing in the stacked argument
we can plot bar chart by plt.bar()
to stop labels from overlapping, we can tilt them by plt.xticks(rotation=75)
we can plot a heat map using sns.heatmap(), a heat map is a good way to visualize 2D data

Basic principle

When starting data analysis, it is important to understand the data collected, how is it collected, and how accurate is it.
then we want to do some data preparation and cleaning, to prepare the data for analysis, and clean unwanted data.
then we can do some exploratory analysis and visualization, this is when we don't have any specific topic in mind, but just poke around and see what the data is like
then at last we try to answer questions with the data

Conclusion

I would like to thank the platform jovian.ai for partnering with freecodecamp.org and giving us this free course, and a platform to learn while doing, where we can use the same Jupyter notebook the instructor used. I want to also thank the instructor Aakash N S for the teaching and the material, he included practical examples for the content so we can better understand.

After this course, I feel confident continuing my data analysis project, which can be found on my Hashnode blog.

What I learned from the 10-hour data science course - Data Analysis with Python Course - NumPy, Pandas, Data Visualization

Conda

Jupyter Notebook

Common keyboard shortcuts in Jupyter Notebook

Numpy

Pandas

Matplotlib and Seaborn

Basic principle

Conclusion

Comments

More from this blog

Focuster review: My experience as a student in summer [week 3]

Reclaim.AI review: My experience as a student in internship [week 2]

Motion Review: My experience as a student in internship [week 1]

Trialling Productivity Tools to Rescue My Time [Week 0]

Use cases of my prayer tracker website

Command Palette

Conda

Jupyter Notebook

Common keyboard shortcuts in Jupyter Notebook

Numpy

Pandas

Matplotlib and Seaborn

Basic principle

Conclusion

Comments

More from this blog