## Guidelines for Statistical Methods for *JNEB*

The following are guidelines for authors and reviewers concerning statistical methods appropriate for Research Articles and Research Briefs.

*P*-Value Guidelines

When presenting *P*-values in text, tables, or figures, *P*-values greater than 0.01 should be reported to 2 decimal places (e.g., *P* = 0.03, *P* = 0.02, *P* = 0.07) and those between 0.01 and 0.001 to 3 decimal places (e.g., *P* = 0.002, *P* =0.007).

*P*-values less than 0.001 should be reported as *P* < 0.001.

While a significance level can be set at a value (e.g., *P* < 0.05), the significance of data should not be stated as *P* < 0.05, but rather the exact *P*-value. All *P*-values (whether significant or not) should be listed in narrative, tables, and figures. For example, authors may have significance set at *P* < 0.05 in their methodology; when expressing the data for vegetable intake between two samples, for example, write "group A mean intake was 2.0 + 0.3 vs. group B mean intake of 0.5 + 0.7, *P* = 0.02". The *P*-values for all predictor variables in regression should be listed in tables.

The rationale for this decision is derived from input from our statistical reviewers, who believe that the *P*-value is a continuous measure that expresses the compatibility between the study hypothesis and the observed data. Reporting or interpreting *P* < 0.05 as statistical significance with individual data represents a loss of information.

Abstracts should include significant values as described above but may reflect non-significant data as non-significant without a *P*-value.

## cRCT guidelines

The cRCT study design occurs when groups (eg schools, classrooms, clinics) are randomized but the data collection occurs at the individual level. This presents some challenges for the authors and reviewers. JNEB asks that the following be included in all cRCT study designs:

- The number within a group, the number of groups, and the strength of group-level dependency (intraclass correlation coefficient [ICC]) should be considered.
- A power analysis using the outcome variable of interest and inconsideration of the cluster will help define how many subjects are needed to see a “true” effect and the effect size.
- Include whether your model used fixed effects, random effects, or mixed effects with consideration of your cluster and a reference or explanation as to why this model was chosen.

Note: Authors need to have more than 2 clusters per group to be able to see significant differences when the cluster is accounted for. Papers with only 2 treatment groups and 2 control groups (or less) will not be accepted unless they are pilot trials with outcomes related to feasibility or effect size as the main outcomes.

**Participants**

Quantitative data

- How did the authors decide number upon the number of participants? Power analysis is the strongest rationale for determining the number of participants. If there was no power analysis another rationale for participant recruitment should be provided.

Qualitative data

- How did the authors decide number upon the number of participants? Recruitment until responses were saturated is the preferred method of determining the number of participants. If this was not the rationale, a rationale should be provided.

**Surveys**

Was reliability of the survey tested?

- Internal reliability
- Usually Cronbach α is reported for multiple items [questions] relating to a similar idea or construct.
- In general, we expect Cronbach alpha to follow the recommendations of George and Mallery (2003) who suggest the following rules of thumb for evaluating alpha coefficients, "> 0.8 good, > 0.7 and < 0.8 acceptable, > 0.6 and < 0.7 questionable, > 0.5 and < 0.6 poor, and < 0.5 unacceptable." Values of 0.90 or greater suggests redundancies of items. Although acceptable in terms of publication, authors may want to acknowledge coefficient values less than acceptable (i.e., < 0.7) as a limitation of the tool.
- For Cronbach alpha less than 0.70, authors should try deleting items to improve Cronbach alpha; not combining items into a composite score; not using any of the items in the results or analysis.
- If Cronbach alpha is close to .70 and there is less than 100 participants, authors should acknowledge in limitations that this measure may not be valid because of limited participants.
- If Cronbach’s alpha is in the poor or unacceptable range (< 0.6), these items should not be used in the results or discussion..
- Of note, a large sample, n ≥ 200 will produce a more reliable Cronbach alpha, although n = 100 may be reliable.
- Discretion is left to the editor if novel, pilot, or unique data are involved. In addition, authors may employ another measures of internal consistency with citations and explanations of the statistic.

- Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297-334.
- George D, Mallery P. SPSS for Windows step by step guide: A simple guide and reference. 11.0 update (4th ed.). Boston, MA: Allyn and Bacon; 2003.
- Yurdugul H. Minimum sample size for Cronbach’s coefficient alpha: a Monte-Carlo study. HU J Educ. 2008;35:397-405.

- Inter-rater reliability
- Usually Kappa statistics are used to determine the reliability among several raters, educators, or surveyors.

- Reliability over time or stability
- Usually a test re-test is used to determine if the items, questions, or constructs are stable over time (no difference between two time points) when there is no intervention. The statistic used is often a t-test, ANOVA, or chi-square, depending on the question and response type.

- If citations are used to verify the reliability of a survey or survey items, they should have been tested in similar population.

## Cross-Sectional Data Guidelines for Authors and Reviewers

Data analyses for cross-sectional data should begin with tests of distribution:

p-p plots, q-q plots, skewness and kurtosis, Wilk-Shapiro, Kolmogorov-Smirnov, or Lilliefors.

- If normally distributed: comparing means [t-tests] or variances [ANOVA].
- If not normally distributed, then use chi square [indicating which chi square], Mann-Whitney U, Kolmogorov-Smirnov Z.
- If log-transforming to address non-normality of variable distributions, explain how and why with references.
- If categorical with 2 categories, use binomial analyses.

### For chi-square analyses

- If chi-square analysis was used, please indicate if chi-square goodness of fit, test of homogeneity, or test of independence was used.
- Chi-square
*goodness of fit*is used to test if the distribution of 1 categorical variable is the same or different from the expected distribution. May also be called Pearson’s chi square goodness of fit. Most widely used chi-square. Appropriate for use with unpaired date from large samples. **Example:**Are the selections of pineapple chunks, cookies, and ice cream as a dessert choice by fifth graders the same?- Results presented as chi-square statistic, df, and
*P*. If*P*≤ 0.05, we reject the null hypothesis that the desserts are selected equally (ie,*there are significant differences*in dessert selection). - Chi-square
*test of homogeneity*is used to determine if 2 or more distributions of the same categorical variable come from the same population distribution. **Example:**Are the distributions of responses about frequency of eating vegetables the same for adults living on the East Coast, West Coast, and Midwest?- Results presented as chi-square statistic, df, and
*P*. If*P*< 0.05, reject the null hypothesis that the distributions are from the same population (ie, the frequency of eating vegetables is significantly different across regions). - Chi-square
*test of independence*is used to determine if there is an association among 2 or more variables. This test only determines if there is a significant association, not the strength; variables should be nominal, categorical. Cramer’s V may be used to test the strength of relationship among variables, especially if the comparison is more than a 2 x 2 table. **Example:**Is socioeconomic group associated with weight status?- Results presented as chi-square statistic, df, and
*P*. If*P*≤ 0.05, reject the null hypothesis that happiness and wealth are associated (ie, socioeconomic group is not significantly associated with weight status).

McHugh ML. Chi square tests of independence. Biochemia Medica 2013;23(2):143–9.

**Note that significance for chi-square goodness of fit and chi-square test of homogeneity result in concluding significant differences while significance in chi-square tests of independence concludes that there are NOT associations.
**

Mann Whitney U tests whether two samples have been drawn from the same population.

### When presenting correlations

Influential observations should be explained, with supporting reference for either removing influential observations or conducting statistical test with and without them;

Both the strength of the correlation (R^{2}) and statistical significance should be presented.

For large national databases, such as NHANES, appropriate weighting within the analyses should be included and explained.

### When using regression

- Dependent and independent variables should be specified.
- Assumptions of linearity, homogeneity of variance, and independence should be addressed.
- JNEB does not require variance inflation factors (VIF) with regression models, which measure how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. It is used to explain how much amount multicollinearity exists in a regression analysis. However, in cases where exploratory analyses include large numbers of variables or when authors add variables outside of the original model, VIF may be helpful.
- R value should be reported to show strength of relationship.
- R
^{2}should be reported to show how much of the variance is accounted for by the model. - Effect size can be represented by parameter estimate, Cohen’s d or odds ratio.
- Weight of the variables should be presented in a table (standardized betas) variability (eg, SE, 95% CI).

Examples follow:

### Means, SD, medians, interquartile ranges

Means and SD should be presented if data are normally distributed. Means should be to one tenth decimal more than the measurement scale (eg, calorie intake to tenths as measured in whole calories). If the data significance cannot be visualized while adhering to this guidelines, exceptions will be made accordingly. SD should be presented; SEM or SE may be used if presented groups within groups. In particular, SE should be used for nationally representative data, such as NHANES.

Medians and IQR should be presented if data are not normally distributed.

### Missing data

Please be specific and justify the approach to managing missing data.

**Data presentation**

Were the following addressed in Methods and Results?

- Determination & treatment of outliers.
- Treatment of missing data.
- Means and SD if data have a normal distribution; IQR and median if not normally distributed.
- SEM used only if multiple samples gathered.

**Decisions on data analysis**

Did the authors provide a rationale for deciding to use parametric vs nonparametric analyses? Authors should:

- Tell how they decided by testing the distribution of the data for normalcy; how did they test to determine if data were normally distributed [p-p plots, q-q plots, skewness and kurtosis,Wilk-Shapiro, Kolmogorov-Smirnov, or Lilliefors].
- If normally distributed: Comparing means [t-tests] or variances [ANOVA].
- If not normal then Chi square, Mann-Whitney U, Kolmogorov-Smirnov Z.
- If categorical with 2 categories, binomial.

For Chi square analyses:

- If Chi square analysis was used, please indicate if Chi square goodness of fit, test of homogeneity, or test of independence was used.
- Chi square
*goodness of fit*is used to test if the distribution of 1 categorical variable is the same or different from the expected distribution. May also be called Pearson’s chi square goodness of fit. - Example: Are the selections of pineapple chunks, cookies, and ice cream as a dessert choice by fifth graders the same?
- Results presented as Chi square statistic, df, and
*P*. If*P*≤ 0.05, we reject the null hypothesis that the desserts are selected equally (ie,*there are significant differences*in dessert selection). - Chi square
*test of homogeneity*is used to determine if 2 or more distributions come from the same population distribution. - Example: Are the distributions of responses about frequency of eating vegetables the same for adults living on the East coast, West coast, and Midwest?
- Results presented as Chi square statistic, df, and
*P*. If*P*≤ 0.05, reject the null hypothesis that the distributions are from the same population (ie the frequency of eating vegetables is*significantly different*across regions). - Chi square
*tests of independence*is used to determine if there is an association among 2 or more variables. This test only determines if there is a significant association, not the strength; variables should be nominal, categorical. Cramer’s V may be used to test the strength of relationship among variables, especially if the comparison is more than a 2 x 2 table. - Example: Is the degree of happiness associated with the degree of wealth?
- Results presented as Chi square statistic, df, and
*P*. If*P*< 0.05, reject the null hypothesis that happiness and wealth are associated (ie, happiness is*not significantly*associated with wealth). **Note that significance for Chi square goodness of fit and Chi square test of homogeneity result in concluding significant differences while significance in Chi square tests of independence concludes that there are NOT associations.**

**When presenting correlations**

- Outliers should be removed; check for normal distribution or if not normally distributed use log-transformation.

**When using regression**

- Dependent and independent variables should be identified.
- R value should be reported to show strength of relationship.
- R
^{2}should be reported to show how much of the variance is accounted for by the model. - Weight of the variables should be presented in a table.
- SE and CI may also be presented.

Examples follow:

## Guidelines for Reliability and Validity Testing of Questionnaires

- Face validity
- Some testing of questionnaires with the target population should be completed to evaluate understanding; often accomplished with cognitive interviewing.
- For other areas of reliability/validity testing, there should be a sample size rationale with appropriate reference(s) and/or n-to-item ratio, or a statement that addresses limitations that precluded such an
*a priori*rationale. - Internal reliability
- Usually Cronbach’s a is reported for multiple items [questions] relating to a similar idea or construct.
- In general, we expect Cronbach’s alpha to follow the recommendations of George and Mallery (2003) who suggest the following rules of thumb for evaluating alpha coefficients: Greater than 0.8 as ‘good’, within (0.7, 0.8) as ‘acceptable’, within (0.6, 0.7) as ‘questionable’, within (0.5, 0.6) as ‘poor’, and less than 0.5 as ‘unacceptable’. Values of 0.9 or greater may suggest redundancies of items and should be examined for possible removal. Although acceptable in terms of publication, authors may want to acknowledge coefficient values less than acceptable (i.e., < 0.7) as a limitation of the tool.
- For Cronbach’s alpha less than 0.7, authors should try deleting items to improve the value of Cronbach’s alpha; alternate possible ameliorations include not combining items into a composite score and not using any of the items in the results or analysis.
- If Cronbach’s alpha is close to 0.7 and there are less than 100 participants, authors should acknowledge in the limitations section that that this measure may not be valid and testing with a larger sample is needed to corroborate the reliability of the tool.
- If Cronbach’s alpha is in the poor or unacceptable range (< 0.6), these items should not be used in the results or discussion.
- Generally, a large sample (e.g., n ≥ 200) will produce better values for Cronbach’s alpha, although n = 100 may be sufficient to produce ‘good’ or ‘acceptable’ values.
- Discretion is left to the editor if novel, pilot, or unique data are involved. In addition, authors may employ another measures of internal reliability with citations and explanations of the statistic chosen.

- Other tests of internal reliability include factor analysis and temporal reliability.
- Factor analysis
- Sample size is a concern for factor analyses and should be justified.
- Exploratory factor analysis (EFA) studies the relationship between constructs and variables when no prior knowledge of general themes or what possible latent constructs might exist. As such, the analyses are inductive. Some may call this construct validity.
- In general, we expect EFA analysis to include:
- The model fit methodology and an assessment of the fit (e.g., comparative fit index (CFI); the Tucker–Lewis Index (TLI), also known as non-normed fit index; the root mean square error of approximation (RMSEA); or the standardized root mean square of the residuals (SRMR)). Acceptable fits are indicated by a CFI and a TLI of ≥ 0.95; an RMSEA of ≤ 0.06; or an SRMR of ≤ 0.08.
- An inter-factor correlation matrix
- The Scree plot cut-off
- A cutoff value for maintaining a factor loading
- How cross loadings were handled
- Eigenvalue criteria

- Confirmatory factor analysis (CFA) is used when latent constructs have been identified and their relationships are being explored. As such, the analyses are more deductive.
- In general, we expect:
- The number of factors
- Factor loadings
- Model fit estimates and assessment / diagnostics
- Estimated covariance matrix
- Parameter estimates and methods used
- An inter-factor correlation matrix
- Analysis of residual indices
- Discussion or data to determine if factors have the same meaning across groups
- Convergent and divergent validity testing may also be reported

- Temporal reliability or stability (test/re-test reliability)
- Estimation of temporal stability is necessary for tools intended to measure the same concept or construct in more than one occasion, such as in the case of pre-to-post outcome evaluation of interventions. However, test-retest should be used in situations in which variables are not likely to change within the time interval (eg, dietary intake variables can vary, thus are not conducive to test-retest.
- We expect reports of temporal reliability to include:
- The time interval between testing and reference
- The correlation used; or tests of differences (change scores) such as a t-test or the appropriate non-parametric equivalent

- Cronbach LJ. Coefficient alpha and the internal structure of tests.
*Psychometrika*. 1951;16:297-334. - George D, Mallery P.
*SPSS for Windows Step by Step Guide: A Simple Guide and Reference*. 11.0 update (4th edition). Boston, MA: Allyn and Bacon; 2003. - Yurdugul H. Minimum sample size for Cronbach’s coefficient alpha: a Monte-Carlo study.
*Hacettepe University Journal of Education*. 2008;35:397-405. - Hu LT, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives.
*Structural Equation Modeling*. 1999;6:1-55. - Gregorich SE. Do self-report instruments allow meaningful comparisons across diverse population groups? Testing measurement invariance using the confirmatory factor analysis framework.
*Med Care*. 2006;44(11 suppl 3):S78-94. - Townsend M. Evaluating food stamp nutrition education: process for development and validation of evaluation measures.
*J Nutr Educ Behav*. 2006;38:18-24.

*Updated September 26, 2018*