Navigating the world of data science can be a complex task, especially for those just beginning their journey. There are many new concepts and terminologies to learn, each vital to various aspects of data science and machine learning. In this post, we’ll take a step-by-step approach to understand some of these foundational concepts and help you start your data science journey with confidence.

Tensor

In mathematics, a tensor is a generalized form of scalars, vectors, and matrices and can be represented in an array of ’n’ dimensions. The dimension of a tensor is often called its ‘rank’.

# A scalar (rank-0 tensor)
scalar = np.array(5)

# A vector (rank-1 tensor)
vector = np.array([1,2,3,4])

# A matrix (rank-2 tensor)
matrix = np.array([[1,2,3],[4,5,6]])

# Higher order tensor (rank-3 tensor)
tensor = np.array([[[1,2,3],[4,5,6]], [[7,8,9],[10,11,12]]])

Features

In Machine Learning, features are individual measurable properties or characteristics of a phenomenon being observed.

# For example, considering the phenomenon is a dataset of houses for sale, the features might include:
features = ["Area", "Bedrooms", "Bathrooms", "Location"]

Data Splits

Data in Machine Learning is typically split into three sets: Training, Development (or Validation), and Test set. The purpose of each set is as follows:

  • Training set: Used to train the machine learning model.
  • Development set: Also called Hold-out set or Validation set, used to tune hyper-parameters and select features.
  • Test set: Used to evaluate the final performance of the model.

Regression vs Classification

Regression and Classification are both types of Supervised Learning in Machine Learning. The main difference lies in the type of output produced by the two methods. Regression is used to predict a continuous outcome variable (or dependent variable) while Classification is used to predict a categorical outcome variable.

Categorical vs Continuous Variables

Data in the field of data science can be broadly categorized as categorical or continuous.

# An example of categorical variable
car_colors = ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green', 'Green', 'Red', 'Blue']

# An example of ordinal categorical variable
customer_satisfaction = ['not at all', 'somewhat', 'somewhat', 'very', 'not at all', 'very']

# An example of continuous variable
temperatures = [20.5, 25.3, 23.1, 26.7, 22.5, 24.2]

Supervised, Unsupervised and Reinforcement Learning

These are three main categories of machine learning:

  • Supervised learning: The model learns from labeled training data, and aims to predict labels for new, unseen data.
  • Unsupervised learning: The model learns from unlabeled data and discovers patterns or structures in the input data.
  • Reinforcement learning: The model learns to perform an action from experience. It decides the action to perform based on the environment to maximize some notion of cumulative reward.

Conclusion

This post has introduced some of the key foundational concepts in data science and machine learning. Understanding these basic terms and principles is a crucial step in your journey as a data scientist or machine learning enthusiast. It provides a springboard from which to dive deeper and explore more complex and specialized concepts.

For those looking to further extend their understanding, here are some excellent resources:

  1. Python for Data Analysis by Wes McKinney
  2. The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
  3. Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville