How to Get Started with Machine Learning with Free open Datasets (for Beginners)

I. Machine Learning

Machine learning is a multidisciplinary field that covers knowledge in probability theory, statistics, approximation theory, and complex algorithms. It uses computers as tools and focuses on simulating human learning in real-time, efficiently dividing existing knowledge structures to improve learning efficiency.

There are various types of data and ways to model problems in machine learning. Here, we'll introduce machine learning methods based on different learning styles.

1. Supervised Learning

In supervised learning, input data is called "training data," and each data set has a clear label or result. Supervised learning is often used for classification and regression problems. During model building, supervised learning establishes a learning process that compares predicted results with actual results of the training data, adjusting the model until a desired accuracy is achieved.

2. Unsupervised Learning

In unsupervised learning, data isn't specifically labeled. The learning model aims to infer some intrinsic structure in the data. Common applications include association rule learning and clustering.

3. Semi-supervised Learning

In semi-supervised learning, input data is partially labeled. This learning model can be used for prediction, but first, it must learn the data's intrinsic structure to organize it for prediction. Application scenarios include classification and regression, and algorithms often extend commonly used supervised learning algorithms.

4. Reinforcement Learning

In reinforcement learning, unlike supervised models, input data serves only as a means of checking the model's correctness. The input data is fed back into the model, which must adjust immediately.

Different learning styles have different features and applications. Supervised and unsupervised learning models are often used in corporate data applications, while reinforcement learning is more commonly used in robotics or other control-related fields. Semi-supervised learning is often used in image recognition, as labeled data is relatively scarce.

One of the most common problems during training is not having enough data sets or suitable data sets for training.

Below, we'll introduce some popular machine learning image materials, so you won't have to worry about not having enough resources!

1. Dataset Name: Fashion-MNIST

Dataset Overview: Fashion-MNIST is a dataset of Zalando article images, including 60,000 training examples and 10,000 test examples. Each example is a 28x28 grayscale image associated with labels from 10 classes. Fashion-MNIST is a direct replacement for the original MNIST dataset, used for benchmarking machine learning algorithms.

Publisher: Zalando

Publication Date: 2020

Data Format: Image

Dataset Size: 35MB

2. Dataset Name: SVHN

Dataset Overview: SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal data preprocessing and formatting. It's similar in style to MNIST but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly more difficult, unsolved real-world problem (recognizing digits in natural scene images). SVHN is derived from Google Street View house numbers.

Publisher: Stanford University

Data Format: Image

Dataset Size: 2GB

3. Dataset Name: RarePlanes

Dataset Overview: RarePlanes is a unique open-source machine learning dataset that combines real and synthetically generated satellite images. RarePlanes focuses on the value of synthetic data for helping computer vision algorithms automatically detect airplanes and their attributes in satellite images. While other synthetic/real combination datasets exist, RarePlanes is the largest publicly available ultra-high-resolution dataset aimed at testing the value of synthetic data from an indirect cost perspective.


Data Format: Image

Dataset Size: 316MB

4. Dataset Name: CompCars

Dataset Overview: The CompCars dataset contains data from two scenes, including web-based images and surveillance-based images. The web data includes 163 car manufacturers and 1,716 car models, with a total of 136,726 images capturing whole cars and 27,618 images capturing car parts. Whole car images are labeled with bounding boxes and viewpoints. Each car model has five attributes, including top speed, displacement, number of doors, number of seats, and car type. Surveillance data includes 50,000 front-view images of cars.

Publisher: Multimedia Laboratory

Publication Date: 2015

Data Format: Image

Dataset Size: 3GB

5. Dataset Name: nuScenes

Dataset Overview:

The nuScenes dataset is a large-scale autonomous driving dataset with 3D object annotations. It features:

● A full sensor suite (1x LiDAR, 5x RADAR, 6x cameras, IMU, GPS)

● 1,000 scenes with 20+ images each

● 1.4 million camera images, 390,000 LiDAR scans

● Two different cities: Boston and Singapore (left vs. right traffic)

● Detailed map information

● 1.4M 3D bounding boxes manually annotated for 23 object classes

● Attributes such as visibility, activity, and pose

New addition: 1.1B LiDAR points manually annotated for 32 classes

New feature: Explore nuScenes on sisearch

● Free for non-commercial use

Publisher: Motional

Data Format: Image, Point Cloud

Dataset Size: 62GB