Statistics - 1

Estatística Básica

1) PowerPoint com Material de Apoio

2) Variance, Standard Deviation

3) Função de Probabilidade Discreta (Probability Mass Function for Discrete Random Variables), Função de Probabilidade Contínua (Probability Density Function for Continuous Random Variables) e Função de Probabilidade Acumulada (Probability Distribution Function)

4) Summary of Continuous Distributions

5) Cálculo de Percentis e Quantis da PDF Gaussiana usando Python

6) Variance, Standard Deviation, Covariance, Correlation and Covariance Matrix

7) Other Types of Correlation Coefficients, suchs as Spearman´s rho or Kendall´s tau

8) Correlation Does Not Imply Causation

9) What Can We Do When We Observe an Association (CONFOUDING FACTOR and SIMPSON'S PARADOX)?

10) Chapter-11 - Introduction to Hypothesis Testing

1) Material de Apoio

2) Variance, Standard Deviation

population variance and sample variance.jpg

Gerald Keler, "Statistics for Management and Economics", Publisher: ‎Cengage Learning, 11th Edition (March 13, 2017).

Peter Bruce, Andrew Bruce and Peter Gedeck,

“Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python”,

2nd Edition, Publisher: ‎ O'Reilly Media, June 2, 2020.

3) Função de Probabilidade Discreta (Probability Mass Function for Discrete Random Variables); Função de Probabilidade Contínua (Probability Density Function for Continuous Random Variables); e Função de Probabilidade Acumulada (Probability Distribution Function)s

3 conceitos de função distribuição de probabilidade.jpg

T. T. Soong, “Fundamentals of Probability and Statistics for Engineers”, Wiley, 1st edition, March 26, 2004.

4) Summary of Continuous Distributions

summarious of continuos distributions.png

T. T. Soong, “Fundamentals of Probability and Statistics for Engineers”,

1st Edition, Publisher:‎ Wiley-Interscience; 1st edition (March 26, 2004).

5) Cálculo de Percentis e Quantis da PDF Gaussiana usando Python

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

PROBAB_DENST_FUNCTION_STATS_PYTHON-wikip

https://en.wikipedia.org/wiki/Normal_distribution

# -*- coding: utf-8 -*-
"""
Created on Thu Jun 4 17:11:44 2020

@author: Eduardo Sodre
"""

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
#

mu = 0.01
sigma = 0.134 # mean and standard deviation

var_min = norm.ppf(0.0001, loc=mu, scale=sigma)
var_max = norm.ppf(0.9999, loc=mu, scale=sigma)

print(var_min,var_max)

prob_menor_que = norm.cdf(-0.3333, loc=mu, scale=sigma)
print(prob_menor_que)

bins = np.linspace(var_min, var_max, num=50)

var11 = 1/(sigma * np.sqrt(2 * np.pi))
vet11 = ((bins - mu)**2) / (sigma**2)
vet21 = var11 * np.exp( -0.5*vet11)

plt.plot(bins, vet21, linewidth=2, color='r')
plt.grid(True)
plt.show()

6) Variance, Standard Deviation, Covariance, Correlation and Covariance Matrix

image covariance and correlation book Miller.png

Michael B. Miller, “Mathematics and Statistics for Financial Risk Management”, 2nd Edition, 2014.

7) Other Types of Correlation Coefficients,
suchs as Spearman´s rho or Kendall´s tau

Peter Bruce, Andrew Bruce and Peter Gedeck,

“Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python”,

2nd Edition, Publisher: ‎ O'Reilly Media, June 2, 2020.

8) Correlation Does Not Imply Causation

We saw in the last chapter how Pearson’s correlation coefficient measures how close the points on a scatter-plot are to a straight line. When considering English hospitals conducting children’s heart surgery in the 1990s, and plotting the number of cases against their survival, the high correlation showed that bigger hospitals were associated with lower mortality. But we could not conclude that bigger hospitals caused the lower mortality.

This cautious attitude has a long pedigree. When Karl Pearson’s newly developed correlation coefficient was being discussed in the journal Nature in 1900, a commentator warned that ‘correlation does not imply causation’. In the succeeding century this phrase has been a mantra repeatedly uttered by statisticians when confronted by claims based on simply observing that two things tend to vary together. There is even a website that automatically generates idiotic associations, such as the delightful correlation of 0.96 between the annual per-capita consumption of mozzarella cheese in the US between 2000 and 2009, and the number of civil engineering doctorates awarded in each of those years.

There seems to be a deep human need to explain things that happen in terms of simple cause-effect relationships - I am sure we could all construct a good story about all those new engineers gorging on pizzas. There is even a word for the tendency to construct reasons for a connection between what are actually unrelated events - apophenia (the tendency to perceive a connection or meaningful pattern between unrelated or random things (such as objects or ideas)) - with the most extreme case being when simple misfortune or bad luck is blamed on others’ ill-will or even witchcraft.

Unfortunately, or perhaps fortunately, the world is a bit more complicated than simple witchcraft. And the first complication comes in trying to work out what we mean by ‘cause’.

What Is ‘Causation’ Anyway?

Causation is a deeply contested subject, which is perhaps surprising as it seems rather simple in real life: we do something, and that leads to something else. I jammed my thumb in the car door, and now it hurts.

But how do we know that my thumb would not have hurt anyway? Perhaps we can think of what is known as a counter-factual. If I hadn’t jammed my thumb in the door, then my thumb would not hurt. But this will always be an assumption, requiring the rewriting of history, since we can never really know for certain what I might have felt (although in this case I might be fairly confident that my thumb would not suddenly start hurting of its own accord).

This gets even trickier when we allow for the unavoidable variability that underlies everything interesting in real life. For example, the medical community now agrees that smoking cigarettes causes lung cancer, but it took decades for doctors to come to this conclusion. Why did it take so long? Because most people who smoke do not get lung cancer. And some people who do not smoke do get lung cancer. All we can say is that you are more likely to get lung cancer if you smoke than if you do not smoke, which is one reason why it took so long for laws to be enacted to restrict smoking.

So our ‘statistical’ idea of causation is not strictly deterministic. When we say that X causes Y, we do not mean that every time X occurs, then Y will too. Or that Y will only occur if X occurs. We simply mean that if we intervene and force X to occur, then Y tends to happen more often. So we can never say that X caused Y in a specific case, only that X increases the proportion of times that Y happens. This has two vital consequences for what we have to do if we want to know what causes what. First, in order to infer causation with real confidence, we ideally need to intervene and perform experiments. Second, since this is a statistical or stochastic world, we need to intervene more than once in order to amass evidence.

David Spiegelhalter, “The Art of Statistics - Learning from Data”, 2020.

9) What Can We Do When We Observe an Association (CONFOUDING FACTOR and SIMPSON'S PARADOX)?

This is where some statistical imagination is called for, and it can be an enjoyable exercise to guess the reasons why an observed correlation might be spurious. Some are fairly easy: the close correlation between mozzarella consumption and civil engineers is presumably because both measures have been increasing over time. Similarly any correlation between ice-cream sales and drownings is due to both being influenced by the weather. When an apparent association between two outcomes might be explained by some observed common factor that influences both, this common cause is known as a confounder: both the year and weather are potential confounders since they can be recorded and considered in an analysis.

The simplest technique for dealing with confounders is to look at the apparent relationship within each level of the confounder. This is known as adjustment, or stratification. So for example we could explore the relationship between drownings and ice-cream sales on days with roughly the same temperature.

But adjustment can produce some paradoxical results, as shown by an analysis of acceptance rates by gender at Cambridge University. In 1996 the overall acceptance rate to study five academic subjects in Cambridge was slightly higher for men (24% of 2,470 applicants) than it was for women (23% of 1,184 applicants). The subjects were all in what we today call STEM (science, technology, engineering and medicine) subjects, which have historically been studied predominantly by men. Was this a case of gender discrimination?

David Spiegelhalter, “The Art of Statistics - Learning from Data”, 2020.

Seguem alguns dados para você mesmo testar o Paradoxo de Simpson.

10) Chapter-11 - Introduction to Hypothesis Testing

An A/B test (see “A/B Testing” on page 88) is typically constructed with a hypothesis in mind. For example, the hypothesis might be that price B produces higher profit. Why do we need a hypothesis? Why not just look at the outcome of the experiment and go with whichever treatment does better?

The answer lies in the tendency of the human mind to underestimate the scope of natural random behavior. One manifestation of this is the failure to anticipate extreme events, or so-called “black swans” (see “Long-Tailed Distributions” on page 73). Another manifestation is the tendency to misinterpret random events as having patterns of some significance. Statistical hypothesis testing was invented as a way to protect researchers from being fooled by random chance.

Chapter-11 - Introduction to Hypothesis Testing

Gerald Keler, "Statistics for Management and Economics", Publisher: ‎Cengage Learning, 11th Edition (March 13, 2017).