Data Types in Machine Learning: The Complete Guide - Your Guide to Synthetic Passport Data

Pick the wrong data type handling for a feature and your model will either ignore the signal entirely or learn the wrong pattern from it. Data type classification in ML is not a formality — it determines preprocessing strategy, model compatibility, and feature engineering decisions before a single line of training code runs.

This guide covers every data type used in machine learning, how they relate to each other, what preprocessing each requires, and how type classification affects model selection. As of 2026, unstructured data accounts for approximately 80% of enterprise data [1] — understanding how to handle every data type is no longer optional for any ML team.

Why Data Types Matter in Machine Learning

Machine learning algorithms do not read data the way humans do. A model cannot infer that ‘red’, ‘blue’, and ‘green’ are colour categories unless those values are encoded numerically. It cannot treat a date field as a temporal signal unless that field is parsed and engineered into features the model can use. The data type of every feature determines what preprocessing is required before the model can learn from it.

Getting this wrong produces three classes of problems. The first is silent: the model trains without error but learns a nonsense relationship — for example, treating postal codes as continuous numbers where 90210 > 10001 implies a meaningful magnitude. The second is an error: the model refuses to accept a non-numeric input. The third is subtle: the model learns a signal that exists in the training data but not in production, because the feature was encoded inconsistently between training and inference.

Understanding data types is the prerequisite for every downstream decision: feature scaling, missing value handling, encoding strategy, and model architecture selection all depend on it.

The Two Primary Data Type Classifications

ML data divides into two high-level categories — numerical (quantitative) and categorical (qualitative) — each with important subtypes. These are the foundational classifications that drive preprocessing and encoding decisions.

Numerical data

Numerical data is data that can be measured or counted and expressed as a number. It is the most directly usable data type for most ML algorithms: it can be fed into a model with minimal transformation, scaled, and used in mathematical operations.

Numerical data splits into two subtypes. Continuous numerical data can take any value within a range — height, temperature, income, probability scores. Discrete numerical data can only take specific, countable values — number of rooms, transaction count, items in a cart. The distinction matters for certain preprocessing steps (binning, for example, is more natural for continuous data) but both types are processed similarly by most algorithms.

Numerical data also splits by whether it has a meaningful zero point. Interval data has equal spacing between values but no true zero — temperature in Celsius is interval data (0°C does not mean no temperature). Ratio data has a true zero — weight, distance, and income are ratio data. This distinction rarely affects algorithm choice but matters for certain statistical operations.

Categorical data

Categorical data represents groups or categories rather than measurable quantities. A model cannot directly use raw categorical values — they must be encoded into numbers first. The subtype of categorical data determines which encoding strategy is appropriate.

Nominal data has no meaningful order. Gender, blood type, country, product category — none of these have a natural rank. Using label encoding (assigning integers 0, 1, 2) on nominal data implies an ordering that does not exist and can mislead the model. One-hot encoding or embedding is the correct approach.

Ordinal data has a meaningful order but the gaps between values are not necessarily equal. Survey ratings (poor, fair, good, excellent), education level, and income brackets are ordinal. Label encoding is appropriate here because the order carries real information, but arithmetic operations on the encoded values should still be approached with caution.

Summary of primary data types in ML:

Data Type	Subtype	Examples	Encoding Required
Numerical	Continuous	Height, temperature, price, probability	Scaling (StandardScaler, MinMaxScaler)
Numerical	Discrete	Item count, transaction count, age in years	Scaling (optional)
Numerical	Interval	Temperature (°C/°F), credit score ranges	Scaling
Numerical	Ratio	Weight, distance, revenue, speed	Scaling
Categorical	Nominal	Country, colour, product category, blood type	One-hot encoding or embeddings
Categorical	Ordinal	Survey rating, education level, income bracket	Label encoding (preserves order)

Structured, Unstructured, and Semi-Structured Data

Separate from the numerical/categorical distinction, data is also classified by how it is organised. This classification determines the tooling, preprocessing pipeline, and model architecture the project needs.

Structured data

Structured data is organised into rows and columns with a defined schema. Every field has a consistent type and meaning across all records. Relational databases, spreadsheets, and CSV files are structured. Structured data accounts for approximately 20% of enterprise data [1] but powers a disproportionate share of profitable ML applications: fraud detection, churn prediction, product recommendations, and credit scoring all run primarily on structured data.

Structured data is the most model-ready format. Classical ML algorithms — gradient boosting, logistic regression, random forests — work directly with structured data after encoding and scaling. Libraries like Scikit-learn, LightGBM, CatBoost, and XGBoost handle structured data out of the box.

Unstructured data

Unstructured data has no predefined format or schema. Images, audio, free-text, video, and PDFs are unstructured. It accounts for approximately 80% of enterprise data [1] and contains a large share of the signal that ML models in computer vision, NLP, and speech recognition are trained to extract. Processing unstructured data requires a preprocessing step to convert it into a form models can work with — image feature extraction, text tokenisation, audio spectrograms — typically handled by deep learning architectures.

Semi-structured data

Semi-structured data has some organisational markers but does not conform to a strict schema. JSON, XML, NoSQL database documents, and email files are semi-structured. They carry hierarchical structure that can be parsed into features but require additional extraction steps compared to clean tabular data. Semi-structured data is common in web scraping, API responses, and social media data pipelines.

Comparison of data organisation types for ML projects:

Property	Structured	Semi-Structured	Unstructured
Format	Tables, rows, columns	JSON, XML, NoSQL docs	Images, audio, video, free text
Schema	Fixed, predefined	Flexible, self-describing	None
Share of enterprise data	~20%	Variable	~80%
ML tooling	Scikit-learn, XGBoost, LightGBM	Pandas, custom parsers	PyTorch, TensorFlow, HuggingFace
Typical use case	Fraud detection, churn, pricing	API data, web scraping, logs	Computer vision, NLP, speech
Preprocessing complexity	Low–Medium	Medium	High

Time-Series Data

Time-series data is sequential numerical or categorical data where the order of observations carries information. Stock prices, sensor readings, website traffic, patient vital signs, and demand forecasting data are all time-series. The defining feature is that the temporal relationship between observations matters — shuffling the rows destroys meaningful signal.

Time-series requires a different train/test split strategy than standard tabular data. Random splitting leaks future information into training — a problem called data leakage. Chronological splitting — training on earlier periods, testing on later ones — is the correct approach. Appropriate models for time-series include LSTM and transformer-based architectures for sequence tasks, and gradient boosting with lag features and rolling statistics for tabular time-series tasks.

Text Data

Text data is unstructured but warrants its own category in ML because of the range of preprocessing and modelling approaches available. Raw text is converted into numerical representations before model training. Classic approaches produce bag-of-words or TF-IDF vectors — sparse numerical representations that carry word frequency information but lose word order. Modern approaches use tokenisation and embedding models (BERT, GPT, sentence transformers) that encode semantic meaning and contextual relationships.

Text data annotation — named entity recognition, sentiment tagging, intent classification, relation mapping — is a distinct annotation discipline from image annotation. Text classification with labelled datasets is one of the most common supervised learning tasks, with benchmarks like GLUE and SuperGLUE providing standardised evaluation across model families.

Image and Video Data

Image data is a three-dimensional numerical array: height, width, and colour channel values per pixel. A single RGB image of 224×224 pixels contains 150,528 numerical values. Despite being numerical at the raw level, image data is classified as unstructured because the spatial relationships between pixels carry meaning that raw numerical operations cannot capture — convolutional neural networks (CNNs) and vision transformers are designed specifically to process this structure.

Video data extends image data across time, adding a fourth dimension (frames). It combines the challenges of image data — scale, variability, annotation cost — with temporal consistency requirements. Both image and video data require annotation before they can train supervised computer vision models.

Preprocessing Requirements by Data Type

Each data type has a standard preprocessing path. Applying the wrong preprocessing to a data type is one of the most common sources of silent performance degradation.

Data Type	Key Preprocessing Steps	Common Pitfall
Continuous numerical	Scaling (StandardScaler or MinMaxScaler); outlier handling; missing value imputation	Not scaling before distance-based models (KNN, SVM, PCA)
Discrete numerical	Scaling (optional); check for imbalance if used as target	Treating as continuous without checking distribution
Nominal categorical	One-hot encoding or target encoding; handle unseen categories at inference	Label encoding — introduces false ordinal relationship
Ordinal categorical	Label encoding preserving order; check if gap equality matters for the task	One-hot encoding — loses order information
Text	Tokenisation; stopword removal (task-dependent); embeddings or TF-IDF	Using raw string values; ignoring vocabulary mismatch between train and inference
Image	Normalise pixel values to [0,1] or [-1,1]; resize to consistent dimensions; augmentation	No normalisation; inconsistent resizing
Time-series	Chronological split; lag features; rolling statistics; stationarity check	Random split (data leakage); failing to handle non-stationarity

Labelled vs Unlabelled Data

The final data type distinction relevant to ML is whether data carries a target label. Labelled data has a ground truth outcome or category assigned to each example — it is the input for supervised learning. Unlabelled data has no assigned output — it is the input for unsupervised learning (clustering, dimensionality reduction, anomaly detection) or can be combined with a small labelled set in semi-supervised learning.

The practical challenge is that labelling is expensive. Data annotation — the process of assigning labels to raw data — is the primary production cost for supervised ML projects. The ratio of labelled to unlabelled data available, and the cost of producing more labels, directly shapes which learning paradigm is viable for a given project.

Frequently Asked Questions

What are the main data types in machine learning?

Machine learning data divides into two primary classifications: numerical (continuous and discrete) and categorical (nominal and ordinal). Separately, data is classified by structure: structured (tabular), semi-structured (JSON, XML), and unstructured (images, audio, text, video). Time-series is a further category distinguished by its temporal ordering requirement. Each type requires different preprocessing before a model can use it.

What is the difference between nominal and ordinal data?

Nominal data has categories with no meaningful order — colour, country, product type. Ordinal data has categories with a meaningful rank — survey ratings, education level, income brackets. The distinction determines encoding strategy: nominal data requires one-hot encoding or embeddings to avoid implying a false order; ordinal data can use label encoding because the order carries real information.

What is the difference between discrete and continuous data?

Continuous data can take any value within a range — temperature, income, probability. Discrete data can only take specific countable values — number of purchases, years of experience, items in a cart. Both are numerical and processed similarly by most algorithms, but the distinction matters for visualisation, binning decisions, and interpreting model outputs.

Does data type affect which ML algorithm to use?

Yes. Classical ML algorithms (gradient boosting, logistic regression, SVMs) work directly with structured numerical and encoded categorical data. Deep learning architectures are required for raw image, audio, and text data because they learn spatial and sequential structure that classical algorithms cannot capture. Time-series data requires either sequence models or careful feature engineering to prevent data leakage from random splitting.

What is the difference between structured and unstructured data?

Structured data fits into rows and columns with a predefined schema — relational databases and spreadsheets. Unstructured data has no schema — images, audio, video, and free text. Structured data makes up about 20% of enterprise data but is the most immediately model-ready. Unstructured data accounts for about 80% and requires feature extraction or deep learning architectures before a model can learn from it.

What preprocessing does categorical data need?

Nominal categorical data needs one-hot encoding (a binary column for each category) or embedding-based encoding for high-cardinality features. Label encoding (assigning integers) is only appropriate for ordinal data where the order carries meaning. Applying label encoding to nominal features implies a false ordinal relationship that misleads the model and degrades performance.