Data science reference guide:

Table of Contents

1 Data science

Data science is an interdisciplinary field that uses scientific methods, process, algorithms and systems to extract knowledge and insights from structured and unstructured data `https://en.wikipedia.org/wiki/Data_science`

1.1 Machine learning

1.1.1 Experimental design

1.1.2 Predictive modelling

1.1.3 Optimization

1.1.4 Causal inference

Machine learning concepts

1.1.5 Regression \(y(t, f)\) and classification (metric) (te, ve, algorithm m1)

1.1.6 Cluster, feature selection

  1. Curse of dimensionality
  2. Bias-variance tradeoff, neural networks, SVM, etc.
  3. Statistical language such as R, or Python-Python
  4. Scripting languages as python, sh, php, perl

1.2 Techniques - Algorithms

1.2.1 Linear regression

Is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables

1.2.2 Logistic regression

1.2.3 Linear SVM and Kernel SVM

Linear Support Vector Machine (SVN) and Kernel SVN

1.2.4 Trees and ensemble trees

1.2.5 Neural networks and deep learning

1.2.6 K-means/k-modes, GMM (Gaussian Mixture Model) Clusterina

1.2.7 DBSCAN

1.2.8 Hierarchical clustening

1.2.9 PCA, SVD and LDA

  1. PCA - Unsupervised method to understand global properties

    Use Scipy, scikit-learn

  2. Least squares and polynomial fitting for datasets with low dimensions

    Use Numpy, sip

  3. Constrained linear regression, weights do not misbehave

    Use scikit-learn

  4. k-means, unsupervised clustering algorithm, expectation maximization algorithm

    Use scikit-learn

  5. Logistic regression, nonlinearity (sigmoid function), classification

    Use scikit-learn

  6. SVM, support vector machines, linear models->Loss function

    Use scikit-learn

1.2.10 Feedforward neural networks, Multilayered logistic regression classifiers many layers separated by non-linearity

Use scikit-learn->Neural networks, keras

1.2.11 Convolution neural networks, vision based machine learn, image classification

2 Tools

2.1 Numpy

Scientific tools for Python. `https://numpy.org/`

2.2 Scipy

Open-source software for mathematics, science, and engineering. `https://www.scipy.org`

2.3 Scikit-learn

A set of python modules for machine learning and data mining `https://sklearn.org/`

2.3.1 ScikitLearn.jl

implements the popular scikit-learn interface and algorithms in Julia, it supports both models from the scikit-learn library vi PyCall and Julia scosystem `https://github.com/cstjean/ScikitLearn.jl`

2.4 Keras

Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. `https://keras.io/`

2.5 Tensorflow

Library for computation using data flow graphs for scalable machine learning `https://www.tensorflow.org`

2.6 JuliaML

One-stop-shop for learning models from data. It provides general abstractions and algorithms for modeling and optimization, implementations of common models, tools for working with datasets, and much more `https://juliaml.github.io/`


A Library for Support Vector Machines, `https://www.csie.ntu.edu.tw/~cjlin/libsvm/`

Interfaces and extensions to LIBSVM:

2.8 MLJ

A Machine Learning Framework for Julia `https://github.com/alan-turing-institute/MLJ.jl`

3 To organize

3.1 Mineracao e big data

3.1.1 Bit data:

  • Volume: grandes quantidades de dados
  • Velocidade: Analisar os dados em tempo satisfatorio
  • Variedades: Diferentes tipos de dados
  1. Mineracao

    Descobrir padroes usando

    1. Inteligencia artificial
    2. Aprendizado de maquina
    3. Estatisticas de sistema de bancos de dados

    Para extrair information e gerar valor de negocio

    • Padroes de comportamento; tendencias ou predicao
    • Etapas do KDD, knowledge discovery in databases
    • Entendimento (conhecimento do negocio) do problema -> extracao (idenficar as bases de dados i.e., tabelas, atributos, fontes) de dados->
Created by: Ronaldo V. Lobato on 2021-04-11. Last Updated: 2021-04-14 Wed 15:28