Data science reference guide:

Table of Contents

1. Data science

Data science is an interdisciplinary field that uses scientific methods, process, algorithms and systems to extract knowledge and insights from structured and unstructured data `https://en.wikipedia.org/wiki/Data_science`

1.1. Machine learning

1.1.1. Basic concepts of machine learning and data analysis and statistical concepts like expectation values, variance, covariance, correlation functions and errors;

1.1.2. Estimation of errors using cross-validation, blocking, bootstrapping and jackknife methods;

1.1.3. Optimization of functions

1.1.4. Linear Regression and Logistic Regression;

1.1.5. Experimental design

1.1.6. Predictive modelling

1.1.7. Optimization

1.1.8. Cluster, feature selection

  1. Curse of dimensionality
  2. Dimensionality reductions, from PCA to clustering

    PCA vs MDS vs t-SNE

    t-SNE discovers cluster of same-type cell, while PCA and MDA fail to expose interesting data structures. `https://orangedatamining.com/blog/2021/2021-06-17-pca-mds-tsne/`

  3. Bias-variance tradeoff,

1.1.9. Neural networks and deep learning, SVMc

  1. Convolutional Neural Networks

    Designed for processing structured arrays. Used in computer vision and natural language processing. It is a feed-forward neural network (20-30 layers, it has a special kind of layer called the convolutional layer)

  2. Recurrent Neureal Networks and Autoencoders
  3. Statistical language such as R, or Python-Python
  4. Scripting languages as python, sh, php, perl

1.1.10. Boltzmann machines;

1.1.11. Causal inference

Machine learning concepts

1.1.12. Decisions trees and random forests

1.1.13. Regression \(y(t, f)\) and classification (metric) (te, ve, algorithm m1)


1.2. Techniques - Algorithms

Grips of what a model is capable:

  1. Pick the right accuracy metric(s)

    Understand how the model will be used.

  2. Visualize the error metric(s)
  3. Choose a relevant accuracy benchmark
  4. Develop custom metric(s) that speak directly to the model context
  5. Investigate relevant sub-populations within the dataset

1.2.1. Linear regression

Is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables

1.2.2. Logistic regression

1.2.3. Linear SVM and Kernel SVM

Linear Support Vector Machine (SVN) and Kernel SVN

1.2.4. Trees and ensemble trees

1.2.5. Neural networks

1.2.6. K-means/k-modes, GMM (Gaussian Mixture Model) Clusterina

1.2.7. DBSCAN

  1. Advantages of DBSCAN
    • Is great at separating clusters of high density versus clusters of low density within a given dataset.
    • Is great with handling outliers within the dataset.
  2. Disadvantages of DBSCAN
    • Does not work well when dealing with clusters of varying densities (struggles with clusters of similar density), good at separating high from low density.
    • Problems with high dimensional data.

1.2.8. Hierarchical clustening


    Hierarchical density-based spatial clustering of applications with noise. It extends DBSCAN by converting into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

1.2.9. PCA, SVD and LDA

  1. PCA - Unsupervised method to understand global properties

    Use Scipy, scikit-learn

  2. Least squares and polynomial fitting for datasets with low dimensions

    Use Numpy, sip

  3. Constrained linear regression, weights do not misbehave

    Use scikit-learn

  4. k-means, unsupervised clustering algorithm, expectation maximization algorithm

    Use scikit-learn

  5. Logistic regression, nonlinearity (sigmoid function), classification

    Use scikit-learn

  6. SVM, support vector machines, linear models -> Loss function

    Use scikit-learn

1.2.10. Feedforward neural networks, Multilayered logistic regression classifiers many layers separated by non-linearity

Use scikit-learn->Neural networks, keras

1.2.11. Convolution neural networks, vision based machine learn, image classification

1.3. Big data

1.3.1. Data Preprocessing

Normalization: There are many methods for data normalization. min-max normalization, z-score normalization, normalization by decimal scaling

  • Min-max normalization (range normalization)

    Performes a linear transformation on the original data

    \[ v'=\frac{v - {\rm min}}{{\rm max - min}} \]

  • Z-score normalization (zero-mean normalization, standard score normalization), the values for an attribute \(A\) are normalized by the mean and standard deviation of \(A\)

    \[ v'=\frac{v - \mu}{\sigma} \]

  1. One hot enconding

    Process of converting categorical data variables (label values) to they can be provided to machine learning algorithms.

1.4. Elements of AI

Predicting the stock market by fitting a curve to paste data about stock prices -> Kind of AI -> Fitting a simple curve is not really AI. But there are so many different curves to choose from, even if there's a lot of data to constrain them, that one needs machine learning/AI to get useful results.

A GPS navigation system for finding the fastest route. The signal processing and geometry used to determine the coordinates isn't AI, but providing good suggestions for navigation (shortest/fastest routes) is AI, specially if variables such as traffic conditions are taken into account.

Photo editing features as color balance, contrast, and so on, are neither adaptive nor autonomous, but the developers of the applications may use some AI to automatically tune the filters.

Machine learning:

System that improves their performance in a given task with more and more experience or data.

    ------------------   Computer
    machine learning* AI Science

*Deep learning -> complexity of a mathematical models

  1. Definition of AI `cool things that computers can't do`
  2. Machine imitating intelligent human behavior
  3. Autonomous and adaptive systems
  1. State space -> Set of possible solutions
  2. Transitions -> possible moves between one state and another
  3. Costs -> Transitions can differ in ways that make some transitions more preferable or cheaper.

1.4.1. Data science -> Recent umbrella


  1. Machine learning and statistics
  2. Some aspects of CS
    1. Algorithms
    2. Data storage
    3. Web app dev

      Those needs CS and AI. However involve a lot of statistics, busines, law -> Part of CS.

1.4.2. Euler diagram -> Relates to Venn Diagrams

  1. CS -> Relatively broad field. Includes: AI and others subfields such as distributed computing, human computer interaction, software engineering.

1.4.3. Natural Language Processing


  • Search auto-correct and autocomplete The driving engine behind searching-autocomplete are the language model.
  • Language Translator Machine translation is the procedure of automatically converting the text in one language to another language while keeping the meaning intact.
  • Social media monitoring

NLP techniques are used by companies to analyse social media posts and know what customers think about their products.

  • Chatbots Help the companies in achieving the goal of smooth customer experience
  • Survey analysis
  • Targeted advertising
  • Hiring and recruitment
  • Voice assistants
  • Grammar checkers
  • Email filtering

2. Tools

2.1. Jupyter


Project Jupyter is a project and community whose goal is to "develop open-source software, open-standards, and services for interactive computing across dozens of programming languages

2.1.1. Extensions:

  1. nbdime


    Jupyter Notebook Diff and Merge tools

    nbdime provides tools for diffing and merging of Jupyter Notebooks.

    • nbdiff compare notebooks in a terminal-friendly way
    • nbmerge three-way merge of notebooks with automatic conflict resolution
    • nbdiff-web shows you a rich rendered diff of notebooks
    • nbmerge-web gives you a web-based three-way merge tool for notebooks
    • nbshow present a single notebook in a terminal-friendly way
  2. jupyterlab-drawio


    A JupyterLab extension for embedding drawio / mxgraph.

  3. JupyterLab Top Bar


    Monorepo to experiment with the top bar space in JupyterLab.

  4. jupyterlab-spellchecker


    A JupyterLab extension highlighting misspelled words in markdown cells within notebooks and in the text files.

  5. aquirdturtlecollapsibleheadings


    Make headings collapsible like the old Jupyter notebook extension and like Mathematica notebooks.

  6. jupyterlab-git


    A JupyterLab extension for version control using Git

  7. jupyterlab-go-to-definition

    Jump to definition of a variable or function in JupyterLab notebook and file editor.


  8. jupyterlab-jupytext


    Jupytext is a plugin for Jupyter that can save Jupyter notebooks as either:

    1. Markdown files (or MyST Markdown files, or R Markdown documents)
    2. Scripts in many languages.

    Common use cases for Jupytext are:

    1. Doing version control on Jupyter Notebooks
    2. Editing, merging or refactoring notebooks in your favorite text editor
    3. Applying Q&A checks on notebooks.

2.1.2. Tricks

  1. Shell commands

    Exclamation mark before a command

  2. View a List of Shortcuts
    1. Open up a Jupyter Notebook.
    2. Activate the command mode (press Esc).
    3. Press the H key.
    4. See the list of all the shortcuts.
  3. See the list of all the shortcuts.

    % prefix - iew the list of all the available magic commands with %lsmagic

  4. Measure Cell Execution Time

    %%time to get the elapsed time of running a cell of code.

  5. View the Documentation of a Method

    To view the documentation of a method, highlight the method and press Shift + Tab.

2.3. Python

2.3.1. Numpy

Scientific tools for Python. `https://numpy.org/`

2.3.2. Scipy

Open-source software for mathematics, science, and engineering. `https://www.scipy.org`

2.3.3. Scikit-learn

A set of python modules for machine learning and data mining `https://sklearn.org/`

2.3.4. Keras

Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. `https://keras.io/`

2.3.5. Tensorflow

Library for computation using data flow graphs for scalable machine learning `https://www.tensorflow.org`

2.4. Julia

2.4.1. JuliaML

One-stop-shop for learning models from data. It provides general abstractions and algorithms for modeling and optimization, implementations of common models, tools for working with datasets, and much more `https://juliaml.github.io/`

2.4.2. MLJ

A Machine Learning Framework for Julia `https://github.com/alan-turing-institute/MLJ.jl`

2.4.3. DataFrames.jl

DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool, especially for those coming to Julia from R or Python.


2.4.4. Turing.jl

Turing.jl is a Julia library for general-purpose probabilistic programming. Turing allows the user to write models using standard Julia syntax, and provides a wide range of sampling-based inference methods for solving problems across probabilistic machine learning, Bayesian statistics, and data science.




2.4.5. ScikitLearn.jl

implements the popular scikit-learn interface and algorithms in Julia, it supports both models from the scikit-learn library vi PyCall and Julia scosystem


2.4.6. FastAI

FastAI.jl is inspired by fastai, and is a repository of best practices for deep learning in Julia.


2.4.7. SmartTensors

SmartTensors is a general high-performance Unsupervised, Supervised and Physics-Informed Machine Learning and Artificial Intelligence (ML/AI).

SmartTensors includes a series of alternative ML/AI methods / algorithms (NMFk, NTFk, NTTk, SVR, etc.) coupled with constraints (sparsity, nonnegativity, physics, etc.).


2.5. R

2.5.1. proxy

Provides an extensible framework for the efficient calculation of auto- and cross-proximities, along with implementations of the most popular ones.


2.5.2. ggplot2

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.


2.5.3. tidyr

The goal of tidyr is to help you create tidy data. Tidy data is data where:

  1. Every column is variable.
  2. Every row is an observation.
  3. Every cell is a single value.

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in vignette("tidy-data")


2.5.4. tm

A framework for text mining applications within R.


2.5.5. dplyr

A fast, consistent tool for working with data frame like objects, both in memory and out of memory.


2.5.6. tidytext

Using tidy data principles can make many text mining tasks easier


2.5.7. tidyverse

The tidyverse is an opinionated collection of R packages designed for data science


2.5.8. udpipe

Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit


2.5.9. corpus

Text corpus data analysis, with full support for international text (Unicode). Functions for reading data from newline-delimited 'JSON' files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies, including n-grams.


3. To organize

3.1. Mineracao e big data

3.1.1. Bit data:

Conjunto de metodologias utilizadas para capturar, armazenar e processar volume intenso de informações de várias fontes (estruturados e não estruturados) para acelerar tomadas de decisão e buscar vantagem competitiva.

  • Volume: grandes quantidades de dados
  • Velocidade: Analisar os dados em tempo satisfatório
  • Variedades: Diferentes tipos de dados
  1. Tipos de dados
    • Dados estruturados
    • Dados semiestruturados
    • Dados não estruturados
    1. Dado estruturados

      Informações armazenadas em bancos de dados com dados estruturados. A tabela é um exemplo.

    2. Dado semiestruturado

      Dados irregulares com uma estrutura embutida. Estrutura heterogênea. Facilidade de compartilhamento pela internet.

      Um exemplo. XML (eXtensible Markup Language) and Json (JavaScript Object Notation) -> Leve para tráfego de informações, menos bytes que XML, sendo relevante para trafegar milhares de registros.

    3. Dado não estruturado

      Dados sem estrutura pré-definida. Textos são exemplos, bem como fotos, vídeos e voz. Um desafio é extrair informações dos dados não estruturados.

      Maior quantidade de dados gerados são não estruturados.

  2. Ambiente favorável para o big data

    O ambiente favorável para extração de informações dos dados são devido:

    • Baixo custo de armazenamento
    • Aumento de poder de processamento
    • Necessidade de decisão rápida e assertiva
    1. Na elaboração de um projeto de big data deve-se ter atenção:
      1. Volume
        • Mais de 2.5 exabytes (source IBM) de dados por dia. Aproximadamente 90% dos dados foram gerados nos últimos 2 anos.
        • Nos próximos 5 anos, espera-se que o volume dobre a cada ano.
      2. Velocidade

        A velocidade na tomada de decisão é vital para ganho de competitividade. Tomadas de decisão precisam ser feitas em tempo real.

        • Detecção de fraude
        • Ofertas de produtos
        • Determinação de doença grave
      3. Variedade

        Tomada de decisão feita em dados estruturados e não estruturados. Mais de 70% são dados não estruturados.

      4. Veracidade
      5. Vulnerabilidade


      6. Visualização

        TABLEAU and Power BI

      7. Valor

        Projetos devem gerar valor

  • Mineracao

    Descobrir padrões usando

    1. Inteligencia artificial
    2. Aprendizado de maquina
    3. Estatísticas de sistema de bancos de dados

    Para extrair information e gerar valor de negocio

    • Padroes de comportamento; tendencias ou predicao
    • Etapas do KDD, knowledge discovery in databases
    • Entendimento (conhecimento do negocio) do problema -> extracao (idenficar as bases de dados i.e., tabelas, atributos, fontes - internet) de dados -> modelagem e transformação (mais importante): ações fundamentais (seleção de atributos, limpezas de dados inconsistentes, tratamentos de anomalias, conversão, transformação) -> mineração de dados: escolha do método de mineração, avaliação através de métricas -> interpretação de dados.
    • Mineração e aprendizado de máquinas (procurar padrões, que façam sentido para resolução de problemas) -> aprender com modelos (tipos de aprendizado: aprendizado supervisionado -> amostras de dados com respectivas classes; aprendizado não supervisionado -> nenhum dado classificado é dado, este método tenta encontrar características nos dados de entrada -> aplicações/agrupar informações semelhantes (encontrar anomalias nos dados), isto leva a informações valiosas (fraudes em transações financeiras).
  • Aplicações
    1. Gestão e vendas (sazonalidade no número de vendas)
    2. Tecnologia (sugestões: netflix, amazon)
    3. Adm e marketing (clientes)
    4. Educação
    5. Saúde
  • Bibliotecas para trabalhar com mineração de dados e big data
    1. Python: jupyter, numpy, matplotlib, pandas, scikit-learn, nltk, scrapy, pymongo.

      (Anaconda: plataforma para data science

    2. R
    3. (Projects -> rapidminer, weka)
    4. Pandas: Dataframe -> estrutura de dados
  • 3.1.2. Redes neurais artificiais

    1. Ajuste de peso
    2. Delta
    3. Gradient descendet
    4. Backpropagation
    5. Redes neurais com scikit-learn
    6. Perceptron de uma camada
      1. Aplicações de redes neurais
      2. Tipos de aprendizado de máquinas
      3. Perceptron
      4. Treinamento/ajuste de pesos em perceptrons
      5. Implementação de uma rede neural com uma camada

    3.2. SQL

    Comando SELECT

    Buscar dados

    • A linguagem SQL não é igual para todos os SGBDs (SQL Server usa o Tranact-SQL).
    • Chave primaria de uma tabela é um atributo ou conjunto de atributos que identificam unicamente uma linha.


    Created by: Ronaldo V. Lobato on 2021-04-11. Last Updated: 2021-12-13 lun 22:02