Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/SpellCheck.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: Spell Check

on: [pull_request]

jobs:
typos-check:
name: Spell Check with Typos
runs-on: ubuntu-latest
steps:
- name: Checkout Actions Repository
uses: actions/checkout@v4
- name: Check spelling
uses: crate-ci/typos@master
6 changes: 3 additions & 3 deletions EDA/bivariate-julia.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ plot(p1, p2, layout=(@layout [a b]))#, size=fig_size_2)

### Histograms across groups

The density plot is, perhaps, less familiar than a histogram, but makes a better graphic when comparing two or more distributions, as the individual graphics don't overlap. There are attempts to use the histogram, and they can be effective. In @fig-xword-histograms, taken from an article on [fivethirtyeight.com](https://fivethirtyeight.com/features/dan-feyer-american-crossword-puzzle-tournament/) on the time to solve cross word puzzles broken out by day of week, we see a stacked dotplot, which is visually very similar to a histogram, presented for each day of the week. The use of color allows one to distinguish the day, but the overalpping aspect of the graphic inhibits part of the distribution of most days, and only effectively shows the longer tails as the week progresses. In `StatsPlots`, the `grouped hist` function can produce similar graphics.
The density plot is, perhaps, less familiar than a histogram, but makes a better graphic when comparing two or more distributions, as the individual graphics don't overlap. There are attempts to use the histogram, and they can be effective. In @fig-xword-histograms, taken from an article on [fivethirtyeight.com](https://fivethirtyeight.com/features/dan-feyer-american-crossword-puzzle-tournament/) on the time to solve cross word puzzles broken out by day of week, we see a stacked dotplot, which is visually very similar to a histogram, presented for each day of the week. The use of color allows one to distinguish the day, but the overlapping aspect of the graphic inhibits part of the distribution of most days, and only effectively shows the longer tails as the week progresses. In `StatsPlots`, the `grouped hist` function can produce similar graphics.

::: {#fig-xword-histograms}

Expand Down Expand Up @@ -266,7 +266,7 @@ vline!([mean(l)], linestyle=:dash)
hline!([mean(w)], linestyle=:dash)
```

@fig-scatterplot-l-w shows the length and width data in a scatter plot. Jittering would be helpful to show all the data, as it has been discretized and many points are overplotte. The dashed lines are centered at the means of the respective variables. If the mean is the center of a single variable, then $(\bar{x}, \bar{y})$ may be thought of as the center of the paired data. Thinking of the dashed lines meeting at the origin, four quadrants are formed. The correlation can be viewed as a measure of how much the data sits in opposite quadrants. In the figure, there seems to be more data in quadrants I and III then II and IV, which sugests a *positive* correlation, as confirmed numerically.
@fig-scatterplot-l-w shows the length and width data in a scatter plot. Jittering would be helpful to show all the data, as it has been discretized and many points are overplotte. The dashed lines are centered at the means of the respective variables. If the mean is the center of a single variable, then $(\bar{x}, \bar{y})$ may be thought of as the center of the paired data. Thinking of the dashed lines meeting at the origin, four quadrants are formed. The correlation can be viewed as a measure of how much the data sits in opposite quadrants. In the figure, there seems to be more data in quadrants I and III then II and IV, which suggests a *positive* correlation, as confirmed numerically.

By writing the correlation in terms of $z$-scores, the product in that formula is *positive* if the point is in quadrant I or III and negative if in II or IV. So, for example, a big positive number suggests data is concentrated in quadrants I and III or that there is a strong association between the variables. The scaling by the standard deviations, leaves the mathematical constraint that the correlation is between $-1$ and $1$.

Expand Down Expand Up @@ -816,7 +816,7 @@ for (k,d) ∈ pairs(gdf) # GroupKey, SubDataFrame
end
```

Now we identify different regression lines (slope and intercepts) for each cluster. This is done throuh a *multiplicative* model and is specified in the model formula of `StatsModels` with a `*`:
Now we identify different regression lines (slope and intercepts) for each cluster. This is done through a *multiplicative* model and is specified in the model formula of `StatsModels` with a `*`:

```{julia}
m3 = lm(@formula(PetalLength ~ PetalWidth * Species), iris)
Expand Down
18 changes: 9 additions & 9 deletions EDA/tabular-data-julia.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ There are different ways to construct a data frame.

Consider the task of the Wirecutter in trying to select the best [carry on travel bag](https://www.nytimes.com/wirecutter/reviews/best-carry-on-travel-bags/#how-we-picked-and-tested). After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, ...) and compile data, similar to what one person collected in a
[spreadsheet](https://docs.google.com/spreadsheets/d/1fSt_sO1s7moXPHbxBCD3JIKPa8QIZxtKWYUjD6ElZ-c/edit#gid=744941088).
Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatability, loading style, and a last-checked date -- as this market improves constantly.
Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatibility, loading style, and a last-checked date -- as this market improves constantly.

```
product v p l loads checked
Expand All @@ -42,7 +42,7 @@ Minaal 3.0 35 349 Y front panel 2022-09
Genius 25 228 Y clamshell 2022-10
```

We see that product is a character, volume and price numeric, laptop compatability a Boolean value, load style one of a few levels, and the last checked date, a year-month date.
We see that product is a character, volume and price numeric, laptop compatibility a Boolean value, load style one of a few levels, and the last checked date, a year-month date.

We create vectors to hold each. We load the `CategoricalArrays` and `Dates` packages for a few of the variables:

Expand All @@ -51,7 +51,7 @@ using CategoricalArrays, Dates
product = ["Goruck GR2", "Minaal 3.0", "Genius"]
volume = [40, 35, 25]
price = [395, 349, 228]
laptop_compatability = categorical(["Y","Y","Y"])
laptop_compatibility = categorical(["Y","Y","Y"])
loading_style = categorical(["front panel", "front panel", "clamshell"])
date_checked = Date.(2022, [9,9,10])
```
Expand All @@ -60,7 +60,7 @@ With this, we use the `DataFrame` constructor to combine these into one data set

```{julia}
d = DataFrame(product = product, volume=volume, price=price,
var"laptop compatability"=laptop_compatability,
var"laptop compatibility"=laptop_compatibility,
var"loading style"=loading_style, var"date checked"=date_checked)
```

Expand All @@ -79,7 +79,7 @@ In the above construction, we repeated the names of the variables to the constr

```{julia}
d = DataFrame(; product, volume, price,
var"laptop compatability"=laptop_compatability,
var"laptop compatibility"=laptop_compatibility,
var"loading style"=loading_style, var"date checked"=date_checked)
```

Expand All @@ -94,7 +94,7 @@ d = DataFrame() # empty data frame
d.product = product
d.volume = volume
d.price = price
d."laptop compatability" = laptop_compatability
d."laptop compatibility" = laptop_compatibility
d."loading style" = loading_style
d."date checked" = date_checked
d
Expand Down Expand Up @@ -225,7 +225,7 @@ The `rename!` function allows the names to be changed in-place (without returnin

### Indexing and assignment

The values in a data frame can be referenced programatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of `d` can be seen to be `349` by observation
The values in a data frame can be referenced programmatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of `d` can be seen to be `349` by observation

```{julia}
d
Expand Down Expand Up @@ -441,7 +441,7 @@ cars1 = filter(:Manufacturer => ==("Volkswagen"), cars)
cars2 = filter(:MPGCity => >=(20), cars1)
```

The above required the introduction of an intermediate data frame to store the result of the first `filter` call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldly, so staged modification is common. To support that, the chaining or piping operation (`|>`) is often used:
The above required the introduction of an intermediate data frame to store the result of the first `filter` call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldy, so staged modification is common. To support that, the chaining or piping operation (`|>`) is often used:

```{julia}
filter(:Manufacturer => ==("Volkswagen"), cars) |>
Expand Down Expand Up @@ -590,7 +590,7 @@ When `AsTable` is used on the source columns, as in `AsTable([:p,:v])` then the

#### Transform

Extending the columns in the data frame by `select` is common enough that the function `transform` is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transfrom is equivalent to `select(df, :, args...)`.
Extending the columns in the data frame by `select` is common enough that the function `transform` is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transform is equivalent to `select(df, :, args...)`.


::: {.callout-note}
Expand Down
2 changes: 1 addition & 1 deletion EDA/univariate-julia.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,7 @@ When a vector is passed to a function, if there is no copy made (as opposed to a
:::


Multiple values can be assigned at once. For example, if the data was mis-arranged chronologically, we might have:
Multiple values can be assigned at once. For example, if the data was misarranged chronologically, we might have:

```{julia}
whale[ [1,2,3] ] = [235, 74, 122]
Expand Down
8 changes: 4 additions & 4 deletions Inference/distributions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ using CairoMakie, AlgebraOfGraphics

This section quickly reviews the basic concepts of probability.

Mathematically a probability is an assignment of numbers to a collection of events (sets) of a probability space. These values may be understood from a model or through long term frequencies. For example, consider the tossing of a *fair* coin. By writing "fair" the assumption is implicitly made that each side (heads or tails) is equally likely to occur on a given toss. That is a mathematical assumption. This can be reaffirmed by tossing the coin *many* times and counting the frequency of a heads occuring. If the coin is fair, the expectation is that heads will occur in about half the tosses.
Mathematically a probability is an assignment of numbers to a collection of events (sets) of a probability space. These values may be understood from a model or through long term frequencies. For example, consider the tossing of a *fair* coin. By writing "fair" the assumption is implicitly made that each side (heads or tails) is equally likely to occur on a given toss. That is a mathematical assumption. This can be reaffirmed by tossing the coin *many* times and counting the frequency of a heads occurring. If the coin is fair, the expectation is that heads will occur in about half the tosses.

The mathematical model involves a formalism of sample spaces and events. There are some subtleties due to infinite sets, but we limit our use of events to subsets of finite or countably infinite sets or intervals of the real line. A probability *measure* is a function $P$ which assigns each event $E$ a number with:

Expand Down Expand Up @@ -75,7 +75,7 @@ A **discrete** random variable is one which has $P(X = k) > 0$ for at most a fin

A **continuous** random variable is described by a function $f(x)$ where $P(X \leq a)$ is given by the *area* under $f(x)$ between $-\infty$ and $a$. The function $f(x)$ is called the pdf (probability density function). An immediate consequence is the *total* area under $f(x)$ is $1$ and $f(x) \geq 0$.

When defined, the pdf is the basic description of the distribution of a random variable. It says what is *possible* and *how likely* possible things are. For the two cases above, this is done differently. In the discrete case, the possible values are all $k$ where $f(k) =P(X=k) > 0$, but not all values are equally likely unless $f(k)$ is a constant. For the continuous case there are **no** values with $P(X=k) > 0$, as probabilities are assigned to area, and the corresponding area to this event, for any $k$, is $0$. Rather, values can only appear in itervals with positive area ($f(x) > 0$ within this interval) and for equal-length intervals, those with more area above them are more likely to contain values.
When defined, the pdf is the basic description of the distribution of a random variable. It says what is *possible* and *how likely* possible things are. For the two cases above, this is done differently. In the discrete case, the possible values are all $k$ where $f(k) =P(X=k) > 0$, but not all values are equally likely unless $f(k)$ is a constant. For the continuous case there are **no** values with $P(X=k) > 0$, as probabilities are assigned to area, and the corresponding area to this event, for any $k$, is $0$. Rather, values can only appear in intervals with positive area ($f(x) > 0$ within this interval) and for equal-length intervals, those with more area above them are more likely to contain values.

A data set in statistics, $x_1, x_2, \dots, x_n$, is typically modeled by a collection of random variables, $X_1, X_2, \dots, X_n$. That is, the random variables describe the *possible* values that can be collected, the values ($x_1, x_2,\dots$) describe the actual values that were collected. Put differently, random variables describe what can happen *before* a measurement, the values are the result of the measurement.

Expand Down Expand Up @@ -147,7 +147,7 @@ Statistical inference makes statements using the language of probability about t

An intuitive example is the tossing of a fair coin modeling heads by a $1$ and tails by a $0$ then we can *parameterize* the distribution by $f(1) = P(X=1) = p$ and $f(0) = P(X=0) = 1 - P(X=1) = 1-p$. This distribution is summarized by $\mu=p$, $\sigma = \sqrt{p(1-p)}$. A *fair* coin would have $p=1/2$. A sequence of coin tosses, say H,T,T,H,H might be modeled by a sequence of iid random variables, each having this distribution. Then we might expect a few things, where $\hat{p}$ below is the proportion of heads in the $n$ tosses:

* A given data set is not random, but it may be viewed as the result of a random process and had that process been run again would likely result in a different outcome. These different outcomes may be described probabalistically in terms of a distribution.
* A given data set is not random, but it may be viewed as the result of a random process and had that process been run again would likely result in a different outcome. These different outcomes may be described probabilistically in terms of a distribution.
* If $n$ is large enough, the sample proportion $\hat{p}$ should be *close* to the population proportion $p$.
* Were the sampling repeated, the variation in the values of $\hat{p}$ should be smaller for larger sample sizes, $n$.

Expand Down Expand Up @@ -345,7 +345,7 @@ draw(p)
```


In `Distributions` the `Categorical` type can alse have been used to construct this distribution, it being a special case of `DiscreteNonParametric` with the `xs` being $1, \dots, k$.
In `Distributions` the `Categorical` type can also have been used to construct this distribution, it being a special case of `DiscreteNonParametric` with the `xs` being $1, \dots, k$.

The multinomial distribution is the distribution of counts for a sequence of $n$ iid random variables from a `Categorical` distribution. This generalizes the binomial distribution. Let $X_i$ be the number of type $i$ in $n$ samples. Then $X_1 + X_2 + \cdots + X_k = n$, so these are not independent. They have mean $E(X_i)=np_i$, variance $VAR(X_i) = np_i (1-p_i)$, like the binomial, but covariance $COV(X_i, X_j) = -np_i p_j, i \neq j$. (Negative, as large values for $X_i$ correlate with smaller values for $X_j$ when $i \neq j$.)

Expand Down
6 changes: 3 additions & 3 deletions Inference/inference.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -643,7 +643,7 @@ confint(OneSampleTTest(ys), level = 0.95)

The two differ -- they use different sampling distributions and methods -- though simulations will show both manners create CIs capturing the true mean at the rate of the confidence level.

The above example does not showcase the advantage of the maximimum likelihood methods, but hints at a systematic way to find confidence intervals, which for some cases is optimal, and is more systematic then finding some pivotal quantity (e.g. the $T$-statistic under a normal population assumption).
The above example does not showcase the advantage of the maximum likelihood methods, but hints at a systematic way to find confidence intervals, which for some cases is optimal, and is more systematic then finding some pivotal quantity (e.g. the $T$-statistic under a normal population assumption).



Expand All @@ -656,7 +656,7 @@ The basic setup is similar to a courtroom trial in the United States -- as seen
* a defendant is judged by a jury with an *assumption of innocence*
* presentation of evidence is given
* the jury weighs the evidence *assuming* the defendant is innocent.
* If it is a civil trial a *preponderence of evidence* is enough for the jury to say the defendent is guilty (not innocent); if a criminal trial the standard is if the evidence is "beyond a reasonable doubt" then the defendent is deemed not innocent. Otherwise the defendant is said to be "not guilty," though really it should be that they weren't "proven" to be guilty.
* If it is a civil trial a *preponderance of evidence* is enough for the jury to say the defendant is guilty (not innocent); if a criminal trial the standard is if the evidence is "beyond a reasonable doubt" then the defendant is deemed not innocent. Otherwise the defendant is said to be "not guilty," though really it should be that they weren't "proven" to be guilty.

In a hypothesis or significance test for parameters, the setup is similar:

Expand Down Expand Up @@ -1388,7 +1388,7 @@ $$
H_0: \mu = \mu_0, \quad H_A: \mu = \mu_1
$$

Suppose the population is $Normal(\mu, \sigma)$. We had a similar setup in the discussion on power, where for a $T$-test specifying three of a $\alpha$, $\beta$, $n$, or an effect size allows the solving of the fourth using known facts about the $T$-statistic. The Neyman-Pearson lemma speaks to the *uniformly most powerful* test under this scenario with a **single** unknown parameter (the mean above, but it could also have been the standard devation, etc.).
Suppose the population is $Normal(\mu, \sigma)$. We had a similar setup in the discussion on power, where for a $T$-test specifying three of a $\alpha$, $\beta$, $n$, or an effect size allows the solving of the fourth using known facts about the $T$-statistic. The Neyman-Pearson lemma speaks to the *uniformly most powerful* test under this scenario with a **single** unknown parameter (the mean above, but it could also have been the standard deviation, etc.).

This test can be realized as a likelihood ratio test, which also covers tests of more generality. Suppose the parameters being tested are called $\theta$ which sit in some subset $\Theta_0 \subset \Theta$. The non-directional alternative would be $\theta$ is in $\Theta \setminus \Theta_0$.

Expand Down
2 changes: 2 additions & 0 deletions _typos.toml
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
[default.extend-words]
Pn = "Pn"
annote = "annote"
Annote = "Annote"
Loading