Data science reference guide:

1. Data science
2. Tools
3. To organize
- 3.1. Mineracao e big data
  - 3.1.1. Bit data:
  - 3.1.2. Redes neurais artificiais
- 3.2. SQL

1. Data science

Data science is an interdisciplinary field that uses scientific methods, process, algorithms and systems to extract knowledge and insights from structured and unstructured data `https://en.wikipedia.org/wiki/Data_science`

1.1. Machine learning

1.1.1. Basic concepts of machine learning and data analysis and statistical concepts like expectation values, variance, covariance, correlation functions and errors;

1.1.2. Estimation of errors using cross-validation, blocking, bootstrapping and jackknife methods;

1.1.3. Optimization of functions

1.1.4. Linear Regression and Logistic Regression;

1.1.5. Experimental design

1.1.6. Predictive modelling

1.1.7. Optimization

1.1.8. Cluster, feature selection

Curse of dimensionality
Dimensionality reductions, from PCA to clustering
1. MDS
  
  Multidimensional scaling is a visual representation of distances or dissimilarities between sets of objects.
  
  PCA vs MDS vs t-SNE
  
  t-SNE discovers cluster of same-type cell, while PCA and MDA fail to expose interesting data structures. `https://orangedatamining.com/blog/2021/2021-06-17-pca-mds-tsne/`
Bias-variance tradeoff,

1.1.9. Neural networks and deep learning, SVMc

Convolutional Neural Networks

Designed for processing structured arrays. Used in computer vision and natural language processing. It is a feed-forward neural network (20-30 layers, it has a special kind of layer called the convolutional layer)
Physics-informed neural networks (PINN)
- Has a superior approximation and generalization capabilities, which made it gain popularity in solving high-dimensional problems, partial differential equations (PDEs), and has been used in the fields of weather, healthcare and manufacturing.
_
Recurrent Neural Networks and Autoencoders
Statistical language such as R, or Python-Python
Scripting languages as python, sh, php, perl

1.1.10. Boltzmann machines;

1.1.11. Causal inference

Machine learning concepts

1.1.12. Decisions trees and random forests

1.1.13. Regression \(y(t, f)\) and classification (metric) (te, ve, algorithm m1)

1.1.14. Distance measures

Euclidean
Manhattan
Chebychev
Minkowski
Cosine
Pearson
Mahalanobis
SED
Jaccard
Levenshtein
Sorensen-Dice
Jensen-Shannon
Camberra
Hamming
Spearman
Chi-Square

1.2. Techniques - Algorithms

Grips of what a model is capable:

Pick the right accuracy metric(s)

Understand how the model will be used.
Visualize the error metric(s)
Choose a relevant accuracy benchmark
Develop custom metric(s) that speak directly to the model context
Investigate relevant sub-populations within the dataset

1.2.1. Linear regression

Is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables

1.2.2. Logistic regression

1.2.3. Linear SVM and Kernel SVM

Linear Support Vector Machine (SVN) and Kernel SVN

1.2.4. Trees and ensemble trees

1.2.5. Neural networks

1.2.6. K-means/k-modes, GMM (Gaussian Mixture Model) Clusterina

1.2.7. DBSCAN

Advantages of DBSCAN
- Is great at separating clusters of high density versus clusters of low density within a given dataset.
- Is great with handling outliers within the dataset.
Disadvantages of DBSCAN
- Does not work well when dealing with clusters of varying densities (struggles with clusters of similar density), good at separating high from low density.
- Problems with high dimensional data.

1.2.8. Hierarchical clustening

HDBSCAN

Hierarchical density-based spatial clustering of applications with noise. It extends DBSCAN by converting into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

1.2.9. PCA, SVD and LDA

PCA - Unsupervised method to understand global properties

Use Scipy, scikit-learn
Least squares and polynomial fitting for datasets with low dimensions

Use Numpy, sip
Constrained linear regression, weights do not misbehave

Use scikit-learn
k-means, unsupervised clustering algorithm, expectation maximization algorithm

Use scikit-learn
Logistic regression, nonlinearity (sigmoid function), classification

Use scikit-learn
SVM, support vector machines, linear models -> Loss function

Use scikit-learn

1.2.10. Feedforward neural networks, Multilayered logistic regression classifiers many layers separated by non-linearity

Use scikit-learn->Neural networks, keras

1.2.11. Convolution neural networks, vision based machine learn, image classification

1.3. Big data

1.3.1. Data Preprocessing

Normalization: There are many methods for data normalization. min-max normalization, z-score normalization, normalization by decimal scaling

Min-max normalization (range normalization)

Performes a linear transformation on the original data

\[ v'=\frac{v - {\rm min}}{{\rm max - min}} \]
Z-score normalization (zero-mean normalization, standard score normalization), the values for an attribute \(A\) are normalized by the mean and standard deviation of \(A\)

\[ v'=\frac{v - \mu}{\sigma} \]

One hot enconding

Process of converting categorical data variables (label values) to they can be provided to machine learning algorithms.
label enconding

1.4. Elements of AI

Predicting the stock market by fitting a curve to paste data about stock prices -> Kind of AI -> Fitting a simple curve is not really AI. But there are so many different curves to choose from, even if there's a lot of data to constrain them, that one needs machine learning/AI to get useful results.

A GPS navigation system for finding the fastest route. The signal processing and geometry used to determine the coordinates isn't AI, but providing good suggestions for navigation (shortest/fastest routes) is AI, specially if variables such as traffic conditions are taken into account.

Photo editing features as color balance, contrast, and so on, are neither adaptive nor autonomous, but the developers of the applications may use some AI to automatically tune the filters.

Machine learning:

System that improves their performance in a given task with more and more experience or data.

--------------------------
	------------------		Computer
	machine learning*	AI	Science
	------------------
--------------------------

*Deep learning -> complexity of a mathematical models

Definition of AI `cool things that computers can't do`
Machine imitating intelligent human behavior
Autonomous and adaptive systems

State space -> Set of possible solutions
Transitions -> possible moves between one state and another
Costs -> Transitions can differ in ways that make some transitions more preferable or cheaper.

1.4.1. Data science -> Recent umbrella

Covers:

Machine learning and statistics
Some aspects of CS
1. Algorithms
2. Data storage
3. Web app dev
  
  Those needs CS and AI. However involve a lot of statistics, busines, law -> Part of CS.

1.4.2. Euler diagram -> Relates to Venn Diagrams

CS -> Relatively broad field. Includes: AI and others subfields such as distributed computing, human computer interaction, software engineering.

1.4.3. Natural Language Processing

Applications:

Search auto-correct and autocomplete The driving engine behind searching-autocomplete are the language model.
Language Translator Machine translation is the procedure of automatically converting the text in one language to another language while keeping the meaning intact.
Social media monitoring

NLP techniques are used by companies to analyse social media posts and know what customers think about their products.

Chatbots Help the companies in achieving the goal of smooth customer experience
Survey analysis
Targeted advertising
Hiring and recruitment
Voice assistants
Grammar checkers
Email filtering

2. Tools

2.1. Jupyter

`https://jupyter.org/`

Project Jupyter is a project and community whose goal is to "develop open-source software, open-standards, and services for interactive computing across dozens of programming languages

2.1.1. Extensions:

nbdime
`https://github.com/jupyter/nbdime`

Jupyter Notebook Diff and Merge tools

nbdime provides tools for diffing and merging of Jupyter Notebooks.
- nbdiff compare notebooks in a terminal-friendly way
- nbmerge three-way merge of notebooks with automatic conflict resolution
- nbdiff-web shows you a rich rendered diff of notebooks
- nbmerge-web gives you a web-based three-way merge tool for notebooks
- nbshow present a single notebook in a terminal-friendly way
jupyterlab-drawio

`https://github.com/QuantStack/jupyterlab-drawio`

A JupyterLab extension for embedding drawio / mxgraph.
JupyterLab Top Bar

`https://github.com/jupyterlab-contrib/jupyterlab-topbar`

Monorepo to experiment with the top bar space in JupyterLab.
jupyterlab-spellchecker

`https://github.com/jupyterlab-contrib/spellchecker`

A JupyterLab extension highlighting misspelled words in markdown cells within notebooks and in the text files.
aquirdturtle_collapsible_headings

`https://github.com/aquirdTurtle/Collapsible_Headings`

Make headings collapsible like the old Jupyter notebook extension and like Mathematica notebooks.
jupyterlab-git

`https://github.com/jupyterlab/jupyterlab-git`

A JupyterLab extension for version control using Git
jupyterlab-go-to-definition

Jump to definition of a variable or function in JupyterLab notebook and file editor.

`https://github.com/krassowski/jupyterlab-go-to-definition`
jupyterlab-jupytext
`https://jupytext.readthedocs.io/en/latest/`

Jupytext is a plugin for Jupyter that can save Jupyter notebooks as either:
1. Markdown files (or MyST Markdown files, or R Markdown documents)
2. Scripts in many languages.
Common use cases for Jupytext are:
1. Doing version control on Jupyter Notebooks
2. Editing, merging or refactoring notebooks in your favorite text editor
3. Applying Q&A checks on notebooks.

2.1.2. Tricks

Shell commands

Exclamation mark before a command
View a List of Shortcuts
1. Open up a Jupyter Notebook.
2. Activate the command mode (press Esc).
3. Press the H key.
4. See the list of all the shortcuts.
See the list of all the shortcuts.

% prefix - iew the list of all the available magic commands with %lsmagic
Measure Cell Execution Time

%%time to get the elapsed time of running a cell of code.
View the Documentation of a Method

To view the documentation of a method, highlight the method and press Shift + Tab.

2.2. LIBSVM

A Library for Support Vector Machines, `https://www.csie.ntu.edu.tw/~cjlin/libsvm/` Interfaces and extensions to LIBSVM:

Python `https://www.csie.ntu.edu.tw/~cjlin/libsvm/#download`
Julia `https://github.com/mpastell/LIBSVM.jl` (SVR in Julia `https://github.com/madsjulia/SVR.jl`)
CUDA `http://mklab.iti.gr/project/GPU-LIBSVM`
GO `https://github.com/ewalker544/libsvm-go`

2.3. Python

2.3.1. Numpy

Scientific tools for Python. `https://numpy.org/`

2.3.2. Scipy

Open-source software for mathematics, science, and engineering. `https://www.scipy.org`

2.3.3. Scikit-learn

A set of python modules for machine learning and data mining `https://sklearn.org/`

2.3.4. Keras

Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. `https://keras.io/`

2.3.5. Tensorflow

Library for computation using data flow graphs for scalable machine learning `https://www.tensorflow.org`

2.4. Julia

2.4.1. JuliaML

One-stop-shop for learning models from data. It provides general abstractions and algorithms for modeling and optimization, implementations of common models, tools for working with datasets, and much more `https://juliaml.github.io/`

2.4.2. MLJ

A Machine Learning Framework for Julia `https://github.com/alan-turing-institute/MLJ.jl`

2.4.3. DataFrames.jl

DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool, especially for those coming to Julia from R or Python.

`https://dataframes.juliadata.org/stable/`

2.4.4. Turing.jl

Turing.jl is a Julia library for general-purpose probabilistic programming. Turing allows the user to write models using standard Julia syntax, and provides a wide range of sampling-based inference methods for solving problems across probabilistic machine learning, Bayesian statistics, and data science.

`https://turing.ml/stable/`

2.4.5. ScikitLearn.jl

implements the popular scikit-learn interface and algorithms in Julia, it supports both models from the scikit-learn library vi PyCall and Julia scosystem

`https://github.com/cstjean/ScikitLearn.jl`

2.4.6. FastAI

FastAI.jl is inspired by fastai, and is a repository of best practices for deep learning in Julia.

`https://fluxml.ai/FastAI.jl/dev`

2.4.7. SmartTensors

SmartTensors is a general high-performance Unsupervised, Supervised and Physics-Informed Machine Learning and Artificial Intelligence (ML/AI).

SmartTensors includes a series of alternative ML/AI methods / algorithms (NMFk, NTFk, NTTk, SVR, etc.) coupled with constraints (sparsity, nonnegativity, physics, etc.).

`https://tensors.lanl.gov/`

2.5. R

2.5.1. proxy

Provides an extensible framework for the efficient calculation of auto- and cross-proximities, along with implementations of the most popular ones.

`https://cran.r-project.org/web/packages/proxy/index.html`

2.5.2. ggplot2

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

`https://ggplot2.tidyverse.org/`

2.5.3. tidyr

The goal of tidyr is to help you create tidy data. Tidy data is data where:

Every column is variable.
Every row is an observation.
Every cell is a single value.

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in vignette("tidy-data")

`https://tidyr.tidyverse.org/index.html`

2.5.4. tm

A framework for text mining applications within R.

`https://cran.r-project.org/web/packages/tm/index.html`

2.5.5. dplyr

A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

`https://cran.r-project.org/web/packages/dplyr/index.html`

2.5.6. tidytext

Using tidy data principles can make many text mining tasks easier

`https://cran.r-project.org/web/packages/tidytext/index.html`

2.5.7. tidyverse

The tidyverse is an opinionated collection of R packages designed for data science

`https://cran.r-project.org/web/packages/tidyverse/index.html`

2.5.8. udpipe

Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

`https://cran.r-project.org/web/packages/udpipe/index.html`

2.5.9. corpus

Text corpus data analysis, with full support for international text (Unicode). Functions for reading data from newline-delimited 'JSON' files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies, including n-grams.

`https://cran.r-project.org/web/packages/corpus/index.html`

2.5.10. DIY ML

(AtomAI)[https://github.com/pycroscopy/atomai]

AtomAI is a Pytorch-based package for deep and machine learning analysis of microscopy data

3. To organize

3.1. Mineracao e big data

3.1.1. Bit data:

Conjunto de metodologias utilizadas para capturar, armazenar e processar volume intenso de informações de várias fontes (estruturados e não estruturados) para acelerar tomadas de decisão e buscar vantagem competitiva.

Volume: grandes quantidades de dados
Velocidade: Analisar os dados em tempo satisfatório
Variedades: Diferentes tipos de dados

Tipos de dados
- Dados estruturados
- Dados semiestruturados
- Dados não estruturados
1. Dado estruturados
  
  Informações armazenadas em bancos de dados com dados estruturados. A tabela é um exemplo.
2. Dado semiestruturado
  
  Dados irregulares com uma estrutura embutida. Estrutura heterogênea. Facilidade de compartilhamento pela internet.
  
  Um exemplo. XML (eXtensible Markup Language) and Json (JavaScript Object Notation) -> Leve para tráfego de informações, menos bytes que XML, sendo relevante para trafegar milhares de registros.
3. Dado não estruturado
  
  Dados sem estrutura pré-definida. Textos são exemplos, bem como fotos, vídeos e voz. Um desafio é extrair informações dos dados não estruturados.
  
  Maior quantidade de dados gerados são não estruturados.
Ambiente favorável para o big data
O ambiente favorável para extração de informações dos dados são devido:
- Baixo custo de armazenamento
- Aumento de poder de processamento
- Necessidade de decisão rápida e assertiva
1. Na elaboração de um projeto de big data deve-se ter atenção:
  1. Volume
    - Mais de 2.5 exabytes (source IBM) de dados por dia. Aproximadamente 90% dos dados foram gerados nos últimos 2 anos.
    - Nos próximos 5 anos, espera-se que o volume dobre a cada ano.
  2. Velocidade
    
    A velocidade na tomada de decisão é vital para ganho de competitividade. Tomadas de decisão precisam ser feitas em tempo real.
    - Detecção de fraude
    - Ofertas de produtos
    - Determinação de doença grave
  3. Variedade
    
    Tomada de decisão feita em dados estruturados e não estruturados. Mais de 70% são dados não estruturados.
  4. Veracidade
  5. Vulnerabilidade
    
    Segurança
  6. Visualização
    
    TABLEAU and Power BI
  7. Valor
    
    Projetos devem gerar valor

Mineracao

Descobrir padrões usando

Inteligencia artificial
Aprendizado de maquina
Estatísticas de sistema de bancos de dados

Para extrair information e gerar valor de negocio

Padroes de comportamento; tendencias ou predicao
Etapas do KDD, knowledge discovery in databases
Entendimento (conhecimento do negocio) do problema -> extracao (idenficar as bases de dados i.e., tabelas, atributos, fontes - internet) de dados -> modelagem e transformação (mais importante): ações fundamentais (seleção de atributos, limpezas de dados inconsistentes, tratamentos de anomalias, conversão, transformação) -> mineração de dados: escolha do método de mineração, avaliação através de métricas -> interpretação de dados.
Mineração e aprendizado de máquinas (procurar padrões, que façam sentido para resolução de problemas) -> aprender com modelos (tipos de aprendizado: aprendizado supervisionado -> amostras de dados com respectivas classes; aprendizado não supervisionado -> nenhum dado classificado é dado, este método tenta encontrar características nos dados de entrada -> aplicações/agrupar informações semelhantes (encontrar anomalias nos dados), isto leva a informações valiosas (fraudes em transações financeiras).

Aplicações

Gestão e vendas (sazonalidade no número de vendas)
Tecnologia (sugestões: netflix, amazon)
Adm e marketing (clientes)
Educação
Saúde

Bibliotecas para trabalhar com mineração de dados e big data

Python: jupyter, numpy, matplotlib, pandas, scikit-learn, nltk, scrapy, pymongo.

(Anaconda: plataforma para data science
R
(Projects -> rapidminer, weka)
Pandas: Dataframe -> estrutura de dados

3.1.2. Redes neurais artificiais

Ajuste de peso
Delta
Gradient descendet
Backpropagation
Redes neurais com scikit-learn
Perceptron de uma camada
1. Aplicações de redes neurais
2. Tipos de aprendizado de máquinas
3. Perceptron
4. Treinamento/ajuste de pesos em perceptrons
5. Implementação de uma rede neural com uma camada

3.2. SQL

Comando SELECT

Buscar dados

A linguagem SQL não é igual para todos os SGBDs (SQL Server usa o Tranact-SQL).
Chave primaria de uma tabela é um atributo ou conjunto de atributos que identificam unicamente uma linha.

sqlitebrowser