Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 68 additions & 8 deletions EDA/tabular-data-julia.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -673,17 +673,27 @@ flatten(dxy, :value)

The split-apply-combine strategy for wrangling data frames was popularized in the influential paper "*The Split-Apply-Combine Strategy for Data Analysis*: [@JSSv040i01] by H. Wickham which introduces this description: "where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together."

#### Grouping data

The `groupby` function for data frames does the breaking up, the `DataFrames` mini language can operate on each piece, and the `combine` function can put things together again.

The `groupby` function splits a data frame into pieces of type `GroupedDataFrame`. The basic signature is `groupby(df, cols; sort=nothing, skipmissing=false)`. The `cols` are one or more column selectors. The value of `sort` defaults to `nothing` leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The `skipmissing` argument can instruct the skipping of rows which have missing values in the selected columns.
Grouping takes one or more *categorical variables* and creates a "grouped" data frame for each different set of *combined* levels. That is one grouped data frame for each possible combination of the levels.
The `groupby` function splits the data frame into pieces of type `GroupedDataFrame`. The basic signature is `groupby(df, cols; sort=nothing, skipmissing=false)`. The `cols` are one or more column selectors. The value of `sort` defaults to `nothing` leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The `skipmissing` argument can instruct the skipping of rows which have missing values in the selected columns.

For example, grouping by loading style, returns two sub data frames.
For example, grouping the baggage data by loading style, returns two sub data frames.

```{julia}
gdf = groupby(d, :load)
```

Grouping by more than one variable is achieved by passing in a selector with multiple variables. The following also creates two sub data frames, but only because the unique combination of both load and date are the same as just load:

```{julia}
gdf = groupby(d, [:load, :d])
```



Indices may be used to get each sub data frame. Moreover, the `keys` method returns keys, like indices, to efficiently look up the different sub data frames in `gdf`. The levels for the keys are found in the property `.load`, as with `keys(gdf)[1].load`. Grouped data frames may be iterated over; the `pairs` iterator iterates over both keys and values.

The `select` and `subset` functions work over `GroupedDataFrame` objects, here we see the returned values are combined:
Expand All @@ -694,12 +704,18 @@ subset(gdf, :p => ByRow(<=(350)))

(The command `[subset(g, :p => ByRow(<=(350))) for g in gdf]` avoids the `combine` step.)

#### Group and combine

We can see that `combine` also works over `GroupedDataFrame` objects. This example shows how to find the mean city mileage of groups formed by the value of `:Origin` in the `cars` data set:

```{julia}
combine(groupby(cars, :Origin), :MPGCity => mean)
```

Typically, the function applied to a column is a reduction, which results in *aggregated data*.

By default the keys used to group (`:Origin`) are kept by `combine`. The `keepkeys` argument for `combine` when applied to grouped data controls this feature.

More than one column can be used to group the data. This example first groups by the number of cylinders, then groups by origin. For each of the 9 resulting groups, the group mean for city mileage is taken:

```{julia}
Expand Down Expand Up @@ -857,7 +873,7 @@ end
#### The SplitApplyCombine package

There is support for the split-apply-combine paradigm outside of the `DataFrames` package in the package `SplitApplyCombine`. This package is much lighter weight than `DataFrames`.
This support targets other representations of tabular data, such as a vector of nested tuples:
This support targets other representations of tabular data, such as a vector of named tuples:


```{julia}
Expand All @@ -871,13 +887,13 @@ tbl = [

Many of the main verbs are generic functions: `filter`, `map`, `reduce`; `group` does grouping. There are some others. A few examples follow.

For indexing this structure, we can't index by `[row, col]`, rather we can use `[row][col]` notation, the first to extract the row, the second to access the entries in the column. For example, he we index a selected row by position, by name, and then with multiple names:
For indexing this structure, we can't index by `[row, col]`, rather we can use `[row][col]` notation, the first to extract the row, the second to access the entries in the column. For example, here we index a selected column by position, by name, and then with multiple names:

```{julia}
tbl[2][2], tbl[2].v, tbl[2][ [:v, :p] ]
```

The `invert` function reverses the access pattern, allowing in this case, `[col][row]` access, with:
The `invert` function from `SplitApplyCombine` reverses the access pattern, allowing, in this case, `[col][row]` access, with:

```{julia}
itbl = invert(tbl)
Expand Down Expand Up @@ -1059,12 +1075,28 @@ cars_price = select(cars, 1:3, r"Price" => ByRow(_mean) => :price)
first(cars_price, 3)
```

To aggregate over columns, the `groupby` construct can be used to specify the groupings to aggregate over. In the following this is the car type:
To aggregate over columns, the `groupby` construct can be used to specify the groupings to aggregate over. In the following this is the car type. The `Chain` style is used to make the steps more explicit:

```{julia}
combine(groupby(cars_price, :Type), :price => mean => :avg_price)
@chain cars_price begin
groupby(:Type)
combine(:price => mean => :avg_price)
end
```

As `mean` is a reduction, the result has one row for each unique set of levels in the column selector specified to `groupby`. Again, that these columns appear in the output is due to the chosen default value of`keepkeys`.

This similar example shows grouping by two variables and then aggregating miles per gallon over each:

```{julia}
@chain cars begin
groupby([:Type, :DriveTrain])
combine(:MPGHighway => mean => :avg_MPG)
sort(:avg_MPG, rev=true)
end
```



### Tidy data; stack, unstack

Expand All @@ -1078,7 +1110,35 @@ These terms are defined by

> A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

In addition to `flatten`, the `DataFrames` package provides the functions `stack` and `unstack` for tidying data. We illustrate with an abbreviated version of an example in the `tidyr` vignette.
In addition to `flatten`, the `DataFrames` package provides the functions `stack` and `unstack` for tidying data.

::: {.callout-note}

### Stack and unstack
The `stack` and `unstack` commands are used to move from "wide" format, where related values are stored in multiple columns with column names representing some level, to "long" format, where related values are stored in one column, and their categorical levels in another column.

For example, this wide format as the attributes are in columns `:A` and `:B`; the values in columns `:C` and `:D`.

```{julia}
wide = DataFrame(A=["A","B"], B=["a","b"], C=["c1","c2"], D=["d1","d2"])
```

With the `stack` function, the values are placed into a single column (by default `:value`) and the variable names placed into a single column (by default `:variable`):

```{julia}
long = stack(wide, [:C, :D])
```

The values in the `:variable` column come from the data frame names.

With `unstack` the data can be spread back out into "wide" format. We pass in the two columns indicating the variable and the values to `unstack`:

```{julia}
unstack(long, :variable, :value)
```


We illustrate this further with an abbreviated version of an example in the `tidyr` vignette.

Consider this data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010. The full data has 31 days per month, this abbreviated data works with just the first 8:

Expand Down
57 changes: 31 additions & 26 deletions Inference/distributions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ These imprecise statements can be made more precise, as will be seen. Figure @fi
```{julia}
#| echo: false
#| label: fig-pop-parameters-sample
#| fig-cap: The top-left figure shows a population, the lower left parameters associated with the population. The top-right shows one sample from the population in red, other possible samples in blue with sample means summarized in the bottom most row. A density estimate of these sample means appears in the lower right figure showing how it is centered on the population, but more concentrated. Inference uses the language of probability to characterize a population parameter based on a single random sample.
#| fig-cap: The top-left figure shows a population, the lower left parameters associated with the population. The top-right shows one sample from the population in red, other possible samples in blue with sample means summarized in the bottom most row. A density estimate of these sample means appears in the lower-right figure showing how it is centered on the population, but more concentrated. Inference uses the language of probability to characterize a population parameter based on a single random sample.
μ, σ = 0, 1
P = Normal(μ, σ)
xs = range(-3, 3, length=251)
Expand Down Expand Up @@ -483,16 +483,14 @@ In @fig-qqplots-distributions the lower-right graphic is of the uniform distrib
#| fig-cap: Quantile-normal plots for different distributions -- $T_3$ is leptokurtic; $T_{100}$ is approximately normal; $U$ is platykurtic; and $E$ is skewed.

probs = range(0.01, 0.99, 40)
Ds = (T₃ = TDist(3), T₁₀₀=TDist(100), U=Uniform(0,1), E=Exponential(1))
d = DataFrame(D=collect(keys(Ds)), qs = [quantile.(D,probs) for D in Ds])
d = flatten(d, :qs)
#data(d) * visual(QQNorm, qqline=:fit) * mapping(:qs, main=:D, color=:D, layout=:D)

d.Z = rand(Normal(0,1), size(d,1))
data(d) * visual(QQPlot, qqline=:fit) * mapping(:Z, :qs, layout=:D) |> draw
Ds = (N = Normal(0,1), T₃ = TDist(3), T₁₀₀=TDist(100),
U=Uniform(0,1), E=Exponential(1))
m = DataFrame([quantile.(D, probs) for D in Ds],collect(keys(Ds)))
d = stack(m, Not(:N)) # need to use QQPlot, so replicate the N
data(d) * mapping(:N, :value, layout=:variable => nonnumeric) *
(visual(QQPlot, qqline=:fit)) |> draw
```


Consider now two, independent Chi-squared random variables $Y_1$ and $Y_2$ with $\nu_1$ and $\nu_2$ degrees of freedom. The ratio

$$
Expand All @@ -511,14 +509,18 @@ kurtosis(FDist(5, 10)), kurtosis(FDist(5,100)), kurtosis(FDist(100,100))
#| echo: false
#| label: fig-f-distributions
#| fig-cap: Quantile-normal plots of the $F$ distribution for different degrees of freedom. As the degrees of freedom values get bigger, the distribution is more normal.
νs = [(1, 10), (5,10), (5, 100), (100, 100)]

probs = range(0.01, 0.99, 40)
d = DataFrame(D = ["F($ν₁,$ν₂)" for (ν₁, ν₂) ∈ νs],
qs = [quantile.(FDist(ν...), probs) for ν ∈ νs])
d = flatten(d, :qs)
#data(d) * visual(QQNorm, qqline=:fit) * mapping(:qs, main=:D, color=:D, layout=:D) |> draw
d.Z = rand(Normal(0,1), size(d,1))
data(d) * visual(QQPlot, qqline=:fit) * mapping(:Z, :qs, layout=:D) |> draw
Ds = (N = Normal(0,1),
var"F(1,10)" = FDist(1,10),
var"F(5,10)" = FDist(5,10),
var"F(5,100)" = FDist(5,100),
var"F(100, 100)" = FDist(100,100)
)
m = DataFrame([quantile.(D, probs) for D in Ds],collect(keys(Ds)))
d = stack(m, Not(:N)) # need to use QQPlot, so replicate the N
data(d) * mapping(:N, :value, layout=:variable => nonnumeric) *
(visual(QQPlot, qqline=:fit)) |> draw
```


Expand Down Expand Up @@ -557,18 +559,19 @@ In general, how large $n$ needs to be depends on the underlying shape of the *po
xbar(D, n, N=200) = [mean(rand(D, n)) for _ in 1:N]
sxbar(D, n, N=200) = (xbar(D,n,N) .- mean(D)) / (std(D)/sqrt(n))

Ds = (normal = Normal(0,1),
Ds = (N = Normal(0,1),
normal = Normal(0,1),
leptokurtic = TDist(3),
skewed = FDist(2,5),
platykurtic = Uniform(0,1))
df = DataFrame(D=collect(keys(Ds)), zs = [sxbar(D, 10) for D in values(Ds)])

d = flatten(df, :zs)
#p = data(d) * visual(QQNorm, qqline=:fit) *
# mapping(:zs, layout = :D, main=:D, color=:D)
# draw(p)
d.Z = rand(Normal(0,1), size(d,1))
data(d) * visual(QQPlot, qqline=:fit) * mapping(:Z, :zs, layout=:D) |> draw
probs = range(0.01, 0.99, 200)
m = DataFrame([quantile.((sxbar(D,10),), probs) for D in Ds],collect(keys(Ds)))
d = stack(m, Not(:N)) # need to use QQPlot
p = data(d) * mapping(:N, :value, layout=:variable => nonnumeric) *
(visual(QQPlot, qqline=:fit))

draw(p)
```


Expand Down Expand Up @@ -670,8 +673,10 @@ Pops = (var"long-tailed"=TDist(3), skewed=Exponential(1))
df = DataFrame([(D="$SDk $Popk", x=quantile(Popv, ps), y = quantile(rand(SDv, N), ps))
for (SDk,SDv) in pairs(SDs) for (Popk, Popv) in pairs(Pops)])

p = data(flatten(df, [:x,:y])) * (visual(Scatter) + linear(;interval=nothing)) *
mapping(:x, :y,layout=:D, color=:D)
d = data(flatten(df, [:x,:y]))
p = d * (visual(Scatter) + linear(; interval=nothing)) *
mapping(:x, :y; layout=:D, color=:D)

draw(p, facet=(; linkxaxes=:none, linkyaxes=:none))

```
Expand Down
8 changes: 4 additions & 4 deletions Inference/inference.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ xs = rand(Y, n); ys = zeros(n);
d = data((x=xs, y=ys))
p = deepcopy(p₀)
p = p + d * visual(Scatter) * mapping(:x,:y)
p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x)
p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x)
draw(p; axis=(; title="n = $n"))
```

Expand All @@ -49,7 +49,7 @@ xs = rand(Y, n); ys = zeros(n);
d = data((x=xs, y=ys))
p = deepcopy(p₀)
p = p + d * visual(Scatter) * mapping(:x,:y)
p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x)
p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x)
draw(p; axis=(; title="n = $n"))
```

Expand All @@ -61,7 +61,7 @@ xs = rand(Y, n); ys = zeros(n);
d = data((x=xs, y=ys))
p = deepcopy(p₀)
p = p + d * visual(Scatter) * mapping(:x,:y)
p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x)
p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x)
draw(p; axis=(; title="n = $n"))
```

Expand All @@ -73,7 +73,7 @@ xs = rand(Y, n); ys = zeros(n);
d = data((x=xs, y=ys))
p = deepcopy(p₀)
p = p + d * visual(Scatter) * mapping(:x,:y)
p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x)
p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x)
draw(p; axis=(; title="n = $n"))
```

Expand Down
7 changes: 7 additions & 0 deletions README-quarto.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
To get this published

```
quarto render # check for Errors...
quarto publish gh-pages --no-render

```
7 changes: 5 additions & 2 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ engines: ['julia']

project:
type: book
output-dir: _book


book:
title: "Using Julia for Introductory Statistics"
Expand All @@ -26,9 +28,10 @@ book:
- EDA/makie.qmd
- Inference/distributions.qmd
- Inference/inference.qmd
# - Inference/permutation_tests.qmd
- LinearModels/linear-regression.qmd
- references.qmd
# - LinearModels/glm.qmd
- references.qmd

bibliography: references.bib
csl: the-american-statistician.csl
Expand Down Expand Up @@ -61,6 +64,6 @@ format:

execute:
warning: false
freeze: false #auto
error: false
# freeze: auto
cache: false
7 changes: 0 additions & 7 deletions check_version.jl

This file was deleted.

1 change: 1 addition & 0 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ science, within which much of this material is covered. For example,
[@storopolihuijzeralonso2021juliadatascience] is very well done,
[@nazarathy2021statisticsjulia] covers topics here (cf. the [JuliaCon
Workshop](https://www.youtube.com/watch?v=IlPoU5Yr2QI)).
[@RomeoAndJulia] covers many of the topics here with a Biologist's viewpoint.
The quarto book [Embrace Uncertainty](https://juliamixedmodels.github.io/EmbraceUncertainty/) [@Embrace-Uncertainty-Fitting-Mixed-Effects-Models-with-Julia] covers the more advanced topic of Mixed-effects Models in `Julia`. Nothing here couldn't be found in those resources, these notes are just an introduction.


Expand Down
8 changes: 8 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,14 @@ @Book{JuliaForDataAnalysis
year = 2022,
url = {https://www.manning.com/books/julia-for-data-analysis}}

@book{RomeoAndJulia,
title = {Romeo and Julia where Romeo is Basic Statistics},
author = {B Lukaszuk},
url = {https://b-lukaszuk.github.io/RJ_BS_eng/},
year = {2023},
}


@article{JSSv107i04,
title={DataFrames.jl: Flexible and Fast Tabular Data in Julia},
volume={107},
Expand Down