Data science reference guide:
Table of Contents
- 1. Data science
- 1.1. Machine learning
- 1.1.1. Basic concepts of machine learning and data analysis and statistical concepts like expectation values, variance, covariance, correlation functions and errors;
- 1.1.2. Estimation of errors using cross-validation, blocking, bootstrapping and jackknife methods;
- 1.1.3. Optimization of functions
- 1.1.4. Linear Regression and Logistic Regression;
- 1.1.5. Experimental design
- 1.1.6. Predictive modelling
- 1.1.7. Optimization
- 1.1.8. Cluster, feature selection
- 1.1.9. Neural networks and deep learning, SVMc
- 1.1.10. Boltzmann machines;
- 1.1.11. Causal inference
- 1.1.12. Decisions trees and random forests
- 1.1.13. Regression \(y(t, f)\) and classification (metric) (te, ve, algorithm m1)
- 1.1.14. Distance measures
- 1.2. Techniques - Algorithms
- 1.2.1. Linear regression
- 1.2.2. Logistic regression
- 1.2.3. Linear SVM and Kernel SVM
- 1.2.4. Trees and ensemble trees
- 1.2.5. Neural networks
- 1.2.6. K-means/k-modes, GMM (Gaussian Mixture Model) Clusterina
- 1.2.7. DBSCAN
- 1.2.8. Hierarchical clustening
- 1.2.9. PCA, SVD and LDA
- 1.2.10. Feedforward neural networks, Multilayered logistic regression classifiers many layers separated by non-linearity
- 1.2.11. Convolution neural networks, vision based machine learn, image classification
- 1.3. Big data
- 1.4. Elements of AI
- 1.1. Machine learning
- 2. Tools
- 3. To organize
1. Data science
Data science is an interdisciplinary field that uses scientific methods, process, algorithms and systems to extract knowledge and insights from structured and unstructured data `https://en.wikipedia.org/wiki/Data_science`
1.1. Machine learning
1.1.1. Basic concepts of machine learning and data analysis and statistical concepts like expectation values, variance, covariance, correlation functions and errors;
1.1.2. Estimation of errors using cross-validation, blocking, bootstrapping and jackknife methods;
1.1.3. Optimization of functions
1.1.4. Linear Regression and Logistic Regression;
1.1.5. Experimental design
1.1.6. Predictive modelling
1.1.7. Optimization
1.1.8. Cluster, feature selection
- Curse of dimensionality
- Dimensionality reductions, from PCA to clustering
- MDS
Multidimensional scaling is a visual representation of distances or dissimilarities between sets of objects.
PCA vs MDS vs t-SNE
t-SNE discovers cluster of same-type cell, while PCA and MDA fail to expose interesting data structures. `https://orangedatamining.com/blog/2021/2021-06-17-pca-mds-tsne/`
- MDS
- Bias-variance tradeoff,
1.1.9. Neural networks and deep learning, SVMc
- Convolutional Neural Networks
Designed for processing structured arrays. Used in computer vision and natural language processing. It is a feed-forward neural network (20-30 layers, it has a special kind of layer called the convolutional layer)
- Physics-informed neural networks (PINN)
- Has a superior approximation and generalization capabilities, which made it gain popularity in solving high-dimensional problems, partial differential equations (PDEs), and has been used in the fields of weather, healthcare and manufacturing.
_
- Recurrent Neural Networks and Autoencoders
- Statistical language such as R, or Python-Python
- Scripting languages as python, sh, php, perl
1.1.10. Boltzmann machines;
1.1.11. Causal inference
Machine learning concepts
1.1.12. Decisions trees and random forests
1.1.13. Regression \(y(t, f)\) and classification (metric) (te, ve, algorithm m1)
*
1.2. Techniques - Algorithms
Grips of what a model is capable:
Pick the right accuracy metric(s)
Understand how the model will be used.
- Visualize the error metric(s)
- Choose a relevant accuracy benchmark
- Develop custom metric(s) that speak directly to the model context
- Investigate relevant sub-populations within the dataset
1.2.1. Linear regression
Is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables
1.2.2. Logistic regression
1.2.3. Linear SVM and Kernel SVM
Linear Support Vector Machine (SVN) and Kernel SVN
1.2.4. Trees and ensemble trees
1.2.5. Neural networks
1.2.6. K-means/k-modes, GMM (Gaussian Mixture Model) Clusterina
1.2.7. DBSCAN
- Advantages of DBSCAN
- Is great at separating clusters of high density versus clusters of low density within a given dataset.
- Is great with handling outliers within the dataset.
- Disadvantages of DBSCAN
- Does not work well when dealing with clusters of varying densities (struggles with clusters of similar density), good at separating high from low density.
- Problems with high dimensional data.
1.2.8. Hierarchical clustening
1.2.9. PCA, SVD and LDA
- PCA - Unsupervised method to understand global properties
Use Scipy, scikit-learn
- Least squares and polynomial fitting for datasets with low dimensions
Use Numpy, sip
- Constrained linear regression, weights do not misbehave
Use scikit-learn
- k-means, unsupervised clustering algorithm, expectation maximization algorithm
Use scikit-learn
- Logistic regression, nonlinearity (sigmoid function), classification
Use scikit-learn
- SVM, support vector machines, linear models -> Loss function
Use scikit-learn
1.2.10. Feedforward neural networks, Multilayered logistic regression classifiers many layers separated by non-linearity
Use scikit-learn->Neural networks, keras
1.2.11. Convolution neural networks, vision based machine learn, image classification
1.3. Big data
1.3.1. Data Preprocessing
Normalization: There are many methods for data normalization. min-max normalization, z-score normalization, normalization by decimal scaling
Min-max normalization (range normalization)
Performes a linear transformation on the original data
\[ v'=\frac{v - {\rm min}}{{\rm max - min}} \]
- Z-score normalization (zero-mean normalization, standard score normalization), the values for an attribute \(A\) are normalized by the mean and standard deviation of \(A\)
\[ v'=\frac{v - \mu}{\sigma} \]
1.4. Elements of AI
Predicting the stock market by fitting a curve to paste data about stock prices -> Kind of AI -> Fitting a simple curve is not really AI. But there are so many different curves to choose from, even if there's a lot of data to constrain them, that one needs machine learning/AI to get useful results.
A GPS navigation system for finding the fastest route. The signal processing and geometry used to determine the coordinates isn't AI, but providing good suggestions for navigation (shortest/fastest routes) is AI, specially if variables such as traffic conditions are taken into account.
Photo editing features as color balance, contrast, and so on, are neither adaptive nor autonomous, but the developers of the applications may use some AI to automatically tune the filters.
Machine learning:
System that improves their performance in a given task with more and more experience or data.
-------------------------- | ||||
------------------ | Computer | |||
machine learning* | AI | Science | ||
------------------ | ||||
-------------------------- |
*Deep learning -> complexity of a mathematical models
- Definition of AI `cool things that computers can't do`
- Machine imitating intelligent human behavior
- Autonomous and adaptive systems
- State space -> Set of possible solutions
- Transitions -> possible moves between one state and another
- Costs -> Transitions can differ in ways that make some transitions more preferable or cheaper.
1.4.1. Data science -> Recent umbrella
Covers:
1.4.2. Euler diagram -> Relates to Venn Diagrams
1.4.3. Natural Language Processing
Applications:
- Search auto-correct and autocomplete The driving engine behind searching-autocomplete are the language model.
- Language Translator Machine translation is the procedure of automatically converting the text in one language to another language while keeping the meaning intact.
- Social media monitoring
NLP techniques are used by companies to analyse social media posts and know what customers think about their products.
- Chatbots Help the companies in achieving the goal of smooth customer experience
- Survey analysis
- Targeted advertising
- Hiring and recruitment
- Voice assistants
- Grammar checkers
- Email filtering
2. Tools
2.1. Jupyter
Project Jupyter is a project and community whose goal is to "develop open-source software, open-standards, and services for interactive computing across dozens of programming languages
2.1.1. Extensions:
- nbdime
`https://github.com/jupyter/nbdime`
Jupyter Notebook Diff and Merge tools
nbdime provides tools for diffing and merging of Jupyter Notebooks.
- nbdiff compare notebooks in a terminal-friendly way
- nbmerge three-way merge of notebooks with automatic conflict resolution
- nbdiff-web shows you a rich rendered diff of notebooks
- nbmerge-web gives you a web-based three-way merge tool for notebooks
- nbshow present a single notebook in a terminal-friendly way
- jupyterlab-drawio
`https://github.com/QuantStack/jupyterlab-drawio`
A JupyterLab extension for embedding drawio / mxgraph.
- JupyterLab Top Bar
`https://github.com/jupyterlab-contrib/jupyterlab-topbar`
Monorepo to experiment with the top bar space in JupyterLab.
- jupyterlab-spellchecker
`https://github.com/jupyterlab-contrib/spellchecker`
A JupyterLab extension highlighting misspelled words in markdown cells within notebooks and in the text files.
- aquirdturtlecollapsibleheadings
`https://github.com/aquirdTurtle/Collapsible_Headings`
Make headings collapsible like the old Jupyter notebook extension and like Mathematica notebooks.
- jupyterlab-git
`https://github.com/jupyterlab/jupyterlab-git`
A JupyterLab extension for version control using Git
- jupyterlab-go-to-definition
Jump to definition of a variable or function in JupyterLab notebook and file editor.
- jupyterlab-jupytext
`https://jupytext.readthedocs.io/en/latest/`
Jupytext is a plugin for Jupyter that can save Jupyter notebooks as either:
- Markdown files (or
MyST Markdown
files, orR Markdown
documents) - Scripts in
many languages
.
Common use cases for Jupytext are:
- Doing version control on Jupyter Notebooks
- Editing, merging or refactoring notebooks in your favorite text editor
- Applying Q&A checks on notebooks.
- Markdown files (or
2.1.2. Tricks
- Shell commands
Exclamation mark before a command
- View a List of Shortcuts
- Open up a Jupyter Notebook.
- Activate the command mode (press Esc).
- Press the
H
key. - See the list of all the shortcuts.
- See the list of all the shortcuts.
% prefix - iew the list of all the available magic commands with %lsmagic
- Measure Cell Execution Time
%%time to get the elapsed time of running a cell of code.
- View the Documentation of a Method
To view the documentation of a method, highlight the method and press
Shift + Tab
.
2.2. LIBSVM
A Library for Support Vector Machines, `https://www.csie.ntu.edu.tw/~cjlin/libsvm/` Interfaces and extensions to LIBSVM:
- Python `https://www.csie.ntu.edu.tw/~cjlin/libsvm/#download`
- Julia `https://github.com/mpastell/LIBSVM.jl` (SVR in Julia `https://github.com/madsjulia/SVR.jl`)
- CUDA `http://mklab.iti.gr/project/GPU-LIBSVM`
- GO `https://github.com/ewalker544/libsvm-go`
2.3. Python
2.3.1. Numpy
Scientific tools for Python. `https://numpy.org/`
2.3.2. Scipy
Open-source software for mathematics, science, and engineering. `https://www.scipy.org`
2.3.3. Scikit-learn
A set of python modules for machine learning and data mining `https://sklearn.org/`
2.3.4. Keras
Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. `https://keras.io/`
2.3.5. Tensorflow
Library for computation using data flow graphs for scalable machine learning `https://www.tensorflow.org`
2.4. Julia
2.4.1. JuliaML
One-stop-shop for learning models from data. It provides general abstractions and algorithms for modeling and optimization, implementations of common models, tools for working with datasets, and much more `https://juliaml.github.io/`
2.4.2. MLJ
A Machine Learning Framework for Julia `https://github.com/alan-turing-institute/MLJ.jl`
2.4.3. DataFrames.jl
DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool, especially for those coming to Julia from R or Python.
2.4.4. Turing.jl
Turing.jl is a Julia library for general-purpose probabilistic programming. Turing allows the user to write models using standard Julia syntax, and provides a wide range of sampling-based inference methods for solving problems across probabilistic machine learning, Bayesian statistics, and data science.
*
*
2.4.5. ScikitLearn.jl
implements the popular scikit-learn interface and algorithms in Julia, it supports both models from the scikit-learn library vi PyCall and Julia scosystem
2.4.6. FastAI
FastAI.jl is inspired by fastai, and is a repository of best practices for deep learning in Julia.
2.4.7. SmartTensors
SmartTensors is a general high-performance Unsupervised, Supervised and Physics-Informed Machine Learning and Artificial Intelligence (ML/AI).
SmartTensors includes a series of alternative ML/AI methods / algorithms (NMFk, NTFk, NTTk, SVR, etc.) coupled with constraints (sparsity, nonnegativity, physics, etc.).
2.5. R
2.5.1. proxy
Provides an extensible framework for the efficient calculation of auto- and cross-proximities, along with implementations of the most popular ones.
2.5.2. ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
2.5.3. tidyr
The goal of tidyr is to help you create tidy data. Tidy data is data where:
- Every column is variable.
- Every row is an observation.
- Every cell is a single value.
Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in vignette("tidy-data")
2.5.4. tm
A framework for text mining applications within R.
2.5.5. dplyr
A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
2.5.6. tidytext
Using tidy data principles can make many text mining tasks easier
`https://cran.r-project.org/web/packages/tidytext/index.html`
2.5.7. tidyverse
The tidyverse is an opinionated collection of R packages designed for data science
`https://cran.r-project.org/web/packages/tidyverse/index.html`
2.5.8. udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit
2.5.9. corpus
Text corpus data analysis, with full support for international text (Unicode). Functions for reading data from newline-delimited 'JSON' files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies, including n-grams.
2.5.10. DIY ML
- (AtomAI)[https://github.com/pycroscopy/atomai]
AtomAI is a Pytorch-based package for deep and machine learning analysis of microscopy data
3. To organize
3.1. Mineracao e big data
3.1.1. Bit data:
Conjunto de metodologias utilizadas para capturar, armazenar e processar volume intenso de informações de várias fontes (estruturados e não estruturados) para acelerar tomadas de decisão e buscar vantagem competitiva.
- Volume: grandes quantidades de dados
- Velocidade: Analisar os dados em tempo satisfatório
- Variedades: Diferentes tipos de dados
- Tipos de dados
- Dados estruturados
- Dados semiestruturados
- Dados não estruturados
- Dado estruturados
Informações armazenadas em bancos de dados com dados estruturados. A tabela é um exemplo.
- Dado semiestruturado
Dados irregulares com uma estrutura embutida. Estrutura heterogênea. Facilidade de compartilhamento pela internet.
Um exemplo. XML (eXtensible Markup Language) and Json (JavaScript Object Notation) -> Leve para tráfego de informações, menos bytes que XML, sendo relevante para trafegar milhares de registros.
- Dado não estruturado
Dados sem estrutura pré-definida. Textos são exemplos, bem como fotos, vídeos e voz. Um desafio é extrair informações dos dados não estruturados.
Maior quantidade de dados gerados são não estruturados.
- Ambiente favorável para o big data
O ambiente favorável para extração de informações dos dados são devido:
- Baixo custo de armazenamento
- Aumento de poder de processamento
- Necessidade de decisão rápida e assertiva
- Na elaboração de um projeto de big data deve-se ter atenção:
- Volume
- Mais de 2.5 exabytes (source IBM) de dados por dia. Aproximadamente 90% dos dados foram gerados nos últimos 2 anos.
- Nos próximos 5 anos, espera-se que o volume dobre a cada ano.
Velocidade
A velocidade na tomada de decisão é vital para ganho de competitividade. Tomadas de decisão precisam ser feitas em tempo real.
- Detecção de fraude
- Ofertas de produtos
- Determinação de doença grave
Variedade
Tomada de decisão feita em dados estruturados e não estruturados. Mais de 70% são dados não estruturados.
- Veracidade
Vulnerabilidade
Segurança
Visualização
TABLEAU and Power BI
Valor
Projetos devem gerar valor
- Volume
Descobrir padrões usando
- Inteligencia artificial
- Aprendizado de maquina
- Estatísticas de sistema de bancos de dados
Para extrair information e gerar valor de negocio
- Padroes de comportamento; tendencias ou predicao
- Etapas do KDD, knowledge discovery in databases
- Entendimento (conhecimento do negocio) do problema -> extracao (idenficar as bases de dados i.e., tabelas, atributos, fontes - internet) de dados -> modelagem e transformação (mais importante): ações fundamentais (seleção de atributos, limpezas de dados inconsistentes, tratamentos de anomalias, conversão, transformação) -> mineração de dados: escolha do método de mineração, avaliação através de métricas -> interpretação de dados.
- Mineração e aprendizado de máquinas (procurar padrões, que façam sentido para resolução de problemas) -> aprender com modelos (tipos de aprendizado: aprendizado supervisionado -> amostras de dados com respectivas classes; aprendizado não supervisionado -> nenhum dado classificado é dado, este método tenta encontrar características nos dados de entrada -> aplicações/agrupar informações semelhantes (encontrar anomalias nos dados), isto leva a informações valiosas (fraudes em transações financeiras).
3.1.2. Redes neurais artificiais
3.2. SQL
Comando SELECT
Buscar dados
- A linguagem SQL não é igual para todos os SGBDs (SQL Server usa o Tranact-SQL).
- Chave primaria de uma tabela é um atributo ou conjunto de atributos que identificam unicamente uma linha.
sqlitebrowser