- 10/22/2020
- 11/16/2020
- 1/11/2021
- 1/20/2021
- 2/17/2021
- 3/10/2021
- 3/12/2021
- 4/16/2021
- 5/13/2021
- 6/8/2021
- 9/15/2021
- 6/9/23
- 10/4/23
Objectives: be proactive/outspoken, considerate, your own advocate, improve organization and English skills, and deliver.
Action: Listen to radio 30 minutes/day, 88.5, get Grammarly, and find opportunities to
practice organization
Strength: Analytical skills, programming skills; Areas to improve: critical thinking skill
Finite Mixture distribution
WWW of Tibetans Study
What: A study of the fertility in Tibetans Why: Understanding natural selection as G X E X S How: Get best data, probe the important features that impact on the outcomes, via a collaborative interdisciplinary team
ToDo:
How to treat missing information:
1.Tree/Random Forest – just run it with the default and the alternative options and show me the results
Priors: P1 -The default (natural) prior probability mimics the natural frequencies of the response variable from a dataset. P2 - The equal prior probability places no bias toward either stage for each record and is based on the symptoms and other predictors, rather than a-priori assumptions.
Missing variable surrogate criterion: M1- ‘Regular (or raw) accuracy’ selects surrogate variables by maximizing the total number of correct classifications for a potential surrogate variable. Thus, both missing and non-missing variables contribute to the surrogate variable. M2- ‘Percent accuracy’ selects surrogate variables by maximizing the percent correct classification, calculated over the non-missing values of the surrogate at the current node. Thus, all non-missing values contribute to the surrogate variable.
If y is missing, remove the row
For X
- Participation: F vs P, do simple summary() the two groups separately
- Age: fill in
General guideline:
- Fill in missing info as much as possible
- Run summary() to get a sense of % missing
- Check the patterns for those are missing: missing(), ggobi
- Remove variables with a large % of missing
- Run analysis with complete data, and with incomplete data Y ~ complete x Y ~ partial columns
- Remove variables
-
Remove handtemp, pulse, sat, hb,
perfusion for 2019, diff, diffavg.
Reason: we should include only the actual variables for handtemp, pulse, sat, hb, and perfusion.
-
Add avg value for id=317 & 676
They missed avg value, but they have 2019 value.
-
Remove breathless, MenstrualStatus2012,
altitude2
-
Question: MenstrualStatus2019 vs.
MenstrualStatus2019-2
U_marstatus2019 vs. U_ marstatus2019two
- Find derive variables. (Color it)
Height, weight, BMI
FEV1,FEV6
FEV1FEV6ratio
LVEwave, LVAwave
LVEA
? TRVEL, TRGRAD, TRmeanGRAD, PASP
TRGRAD = PASP-3
TRGRAD- 98NA’s.
-
Clean data, Check missing patterns: if MAR, imputation or remove; if not MAR, do #2, #3, If # missing for X is large – remove X, if MAR; do #3 if not MAR.
-
Run tree models (NA’s)
-
Run parametric model, note that High TRGRAD is dangerous to fetus
Run Y~X [with TRGRAD] [-mssing row], on the sup-pop w TRGRAD
Y~X [without TRGRAD] [- column of TRGRAD], on the whole population
Y~X [without TRGRAD] [- column of TRGRAD, - rows w. TRGRAD], on subpop w/o TRGRAD [subpop size is 70]
- Compare difference between everpregnants no vs. yes
Slightly difference for some variables No vs. yes
To do:
- Plots VIM of RF (Y
X) to figure out which variables are important by RF [=> X1] a. Tree(YXi) for all I’s, MSE, sort, remove ones that have very large MSE which may be presented as small numbers (ie revised in the VIM plot) [=> X2] - X= cbind(X1, and ones Dr. Beall were interested), or X=cbind(X2, Dr. B’s babys)
- Poisson (Y~ X). [on complete cases]
- Use step() to select variables in Poisson regression
- Repeat when the age of last pregnancy is removed
- Arrange the output by Poisson regression and RF side-by-side, to compare the chosen variables
Check the analyses already done Complete the analyses/comparisons with X1 and X2 as noted in 3/10/21’s notes Dr. B let us know comments before we meet again, if needed
lm: Y ~ X b, E(y_i) = x_i1 * b1+ x_i2* b2+ … glm: Y~Xb, g( E(y_i)) = x_i1 * b1+ x_i2* b2+ … , if Possion
Linear model: (p+1) Y = f(x)+ e= a + b_1 x_1+ ..+ b_p x_p + e = X\beta+e, regression function f(x) is linear in x’s and e ~ normal or Y|x ~ N(mu(x), sigma^2), e~N(0, sigma^2)
Denote: mu(x)=E(Y|x), eta(x) = a + b_1 x_1+ ..+ b_p x_p. In this case, mu(x)=eta(x).
General Linear model: (p+1): Generalized linear model: g(mu(x))=eta(x), Y|x ~ exponential family of distributions
Exponential family of distributions includes: Normal, Poisson, Binomial, Gamma, …
General Linear model: Y = X\beta + Z \alpha + e, \alpha ~ normal, e ~normal
General Generalized Linear model: g(mu(x))= X\beta + Z \alpha
How to compare competing models:
- Look at fitted/predicted plots in one picture:
y ~ f_1(x), y ~e^(f_2(x)) or even residual plots at the equalized scale (pattern and magnitude) and the transformed scales (pattern) - Examine a numerical measure of GOF || y - f_1(x)||^2, ||y -e^(f_2(x))|| (or use GCV for GOP)
To do: • Correction: Remove polyandrous • Sub models:
- Remove age, lengthmarray, b1momage
- Include A, B, C group only, respectively.
• Explain the variable importance measures
- ICSA – hard and soft skills
- Tibetan Study: semiparametric model (p+n independently, stepwisely, iteratively)
- PEM: Formulation, review some references from ICSA
- Whole person training: o Key elements of research (literature search, learning, collaboration, solving problems, communication, …) o Data Science Training o Personal & Professional Development
• Semiparametric model vs GAM Original: Semi: Y = h(x1)+ g(x2)+ e, where h is parametric while g is nonparametric, x1 and x2 mutually exclusive subsets of the predictors x GAM: Y = \sum_{i=1}^p g_i (x_i) +e
- Draft an outline of dissertation
- Create a gitHub space for communication
- Continue with SGD, making a possible connection with sEM pipeline
- Finish preparing the 3 chapters of DL
- Bring in at least one Q with substance
- SGD reference
- start to write
- identifiability
- optimization
-
PEM: complete simulation, do an application (keep in mind of suitable data)
-
ExpM:
- Property A (Coverage and Length of Conformal Prediction Interval)
Property B (Convergence of MM algorithm, characterizing the objective functions, do we need penalty terms, ….)
- Performance
C. Convergence of Joint Estimates?
simulation+proof
D. Correct identification of features?
E. Modeling evaluated by prediction power/stability – repeated CV; CPI ; comparison with simple linear model and DL model?