Select Page

Statistical Methods – The Conventional Approach vs. The Simulation-based Approach

Statistical Methods - The Conventional Approach vs. The Simulation-based Approach blog image.

Abstract. In this blog, we provide a comparison between simulation-based and conventional statistical methods, examining their respective principles, applications, strengths, and weaknesses. We provide a simple real-life example to illustrate similarities and differences between the two approaches. 

As a Senior Biostatistician, I’ve spent years navigating the intricate world of data, seeking the most reliable methods to draw meaningful conclusions. Statistical methods are indispensable tools in data analysis, facilitating decision-making, hypothesis testing, and predictive modeling across diverse domains. Among these methods, simulation-based and conventional statistical approaches stand out for their distinct techniques and applications. The main challenge for every statistical creed is constructing what we refer to as the sampling distribution or the ‘big picture.’ We need to get the most out of the limited collected data and derive results that represent the data itself and provide insight into all the possible data that could have been collected. This construction of the big picture is where the differences between conventional and simulation-based statistical approaches originate. 

What are Statistical Methods? 

Statistical methods are indispensable tools in data analysis, facilitating decision-making, hypothesis testing, and predictive modeling across diverse domains. Among these methods, simulation-based and conventional statistical approaches stand out for their distinct techniques and applications. 

The main challenge for every statistical creed is how to construct what we refer to as the sampling distribution or the ‘big picture”. We need to get the most out of the limited collected data. Therefore, we would like to derive results from such data that not only represents the data itself, but also provides an insight into all the possible data that could have been collected. We would like to know where this collected data stands among the vast possible unobserved data. The construction of such a big picture is where the differences between the conventional and simulation-based statistical approaches originate from. 

Conventional Statistical Methods 

Conventional statistical methods are grounded in mathematical theory and probability distributions. They typically make assumptions about underlying population distributions and utilize sample data to draw conclusions or make inferences. The critical inferences such as p-value and CI (the Confidence Interval) are derived using the test statistics. The test statistics that represent the sampling distribution are constructed using the assumed underlying distribution. 

A classic example of a conventional statistical method is the sample mean x̅ for a normally distributed population. x̅ estimates the actual mean of the total population. It is known to also have a normal distribution with a variance that is equal to that of the sample divided by n. 

Aside from classical hypothesis testing, regression analysis, and analysis of variance (ANOVA) are other prime examples of conventional statistical methods prevalent in research and decision-making. 

Simulation-based Statistical Methods 

Simulation-based methods involve the creation of artificial models to mimic real-world phenomena. These methods leverage repeated random sampling to approximate complex systems or processes. 

There are several simulation-based statistical approaches. The most widely used methods are the Markov Chain Monte Carlo (MCMC) and the resampling approaches, in particular the bootstrapping method. 

The MCMC method relies on probability distributions by constructing a Markov chain that converges to the desired distribution. These simulation-based approaches are particularly useful for Bayesian inference, where the goal is to estimate posterior distributions of model parameters given observed data. MCMC techniques are widely employed in statistics, machine learning, and computational biology. 

The idea behind the bootstrapping process is that each and every subject in the observed sample represents a population of unobserved data. Therefore, if we were to collect a different set of data, we could possibly have similar observations in it. In fact, each observed data could occur more than once as there could always be several individuals out there with the same characteristics. 

The sampling distribution using the simulation-based approaches is created using the many artificial data sets created. Inferences are then based on this sampling distribution. For instance, the p-value is evaluated based on where the statistic from the real observed data falls within the sampling distribution. As another example a 95% CI (here standing for the Credible Interval rather than the Confidence Interval) is constructed by separating a range of the sampling distribution that represents 95% of the possible outcomes (mostly the 2.5% – 97.55% range is chosen). 

A Comparison Between the Two Approaches 

Simulation-based approaches offer several advantages over conventional statistical methods in certain contexts. They are particularly valuable for modeling complex systems and exploring hypothetical scenarios. For example, in multi-hypothesis tests, where multiple competing hypotheses need to be evaluated simultaneously, simulation-based methods can efficiently generate data under each hypothesis to assess their respective likelihoods. Additionally, simulation-based methods excel in power calculation, where the sample size required to detect a predefined effect size with a given level of confidence is determined through repeated simulations under various scenarios. 

On the other hand, conventional statistical methods are preferred when analytical solutions are feasible or when making inferences about well-defined populations. These methods rely on parametric assumptions and statistical tests with known properties, making them suitable for hypothesis testing and parameter estimation. For instance, in power calculation, conventional statistical methods often rely on theoretical formulas derived from probability distributions to determine the sample size needed for a desired level of statistical power. 

The Pros and Cons 

Simulation-based approaches demonstrate robustness against violations of distributional assumptions and can handle nonlinear relationships effectively. However, they can be computationally intensive, requiring significant computational resources for large-scale simulations or optimization problems. Moreover, the validity of simulation results hinges on the accuracy of the underlying model assumptions, introducing uncertainty into the analysis. 

Conventional statistical methods, while simpler to implement and interpret, are limited by their reliance on parametric assumptions and the availability of closed-form solutions. They may not be suitable for complex, real-world problems with unknown or non-standard distributions. Furthermore, conventional statistical methods may fail to capture the inherent variability and randomness present in many natural phenomena, potentially leading to biased or misleading results. 

Both simulation-based and conventional statistical methods offer valuable tools for data analysis and inference, each with its unique strengths and weaknesses. The choice between these methods depends on the nature of the problem, the availability of data, and the underlying assumptions. By understanding the nuances of each approach, researchers can make informed decisions and enhance the rigor and reliability of their statistical analyses. 

Within Clinical Trials 

Considering that the regulatory agencies’ guidelines for statistical analysis when it comes to clinical trials are typically strict, and they are mostly conventional-based, the simulation-based approaches are less often used for analyzing the data that needs to be submitted to the agencies. The main reason is the need to provide justification and prove validity, and transparency of the simulation-based methods. However, for example, such methods can be freely used for calculating the power and sample size or providing results for information purposes. 

An Example of Using the Conventional and the Simulation-based Methods 

In this section we present a simple example to demonstrate how close the results of the two entirely different methods discussed in the previous sections could be. 

The topic, methods and codes presented in this section have been developed by the author. 

We all rely on the Amazon rating scores to decide how good an item might be. Each individual can assign a score of 1 (indicating total dissatisfaction) to 5 (indicating perfect satisfaction) to the item. 

Therefore, a higher average score indicates that the item has satisfied the previous customers better. However, another important factor should be taken into consideration here: The number of raters! Consider two items both with an average score of 4.5, one based on 10 reviews and the other one based on 1000 reviews. Everyone probably knows, intuitively, that the second rating score is more reliable. This fact could be reflected using a statistical-based inference called the one-sided Confidence Interval (conventional statistics) or the one-sided Credible Interval (simulation-based statistics) both abbreviated as [one-sided] CI. 

In simple language, a 95% one-sided CI assures the actual rating score (that would be derived from every single person’s rating in the represented population who could potentially buy the product!!) being greater than the estimated bound with a chance of 95%. As an example, for the above-mentioned examples with an average score of 4.5, the one-sided 95% CI could have a lower bound of 3.4 for the first example and 4.2 for the second example indicating that the real rating score for the first item could potentially be much lower. 

We used a simulation-based approach and a conventional approach to infer this lower bound for several examples. The table below displays the results. 

Total number of raters  Percentages of the rating scores  The average rating score  95% lower bound for the rating score (From the one-sided 95% CI) 
5  4  3  2  1  Conventional method  Simulation-based method 
8  78  11  7  4  0  4.62  4.13  4.25 
68  78  11  7  4  0  4.62  4.45  4.44 
288  61  20  6  6  7  4.22  4.09  4.09 
27300  56  19  11  5  9  4.08  4.07  4.07 

As the table above proves, the results of the two methods are very close or identical! Still, for the first row with a small sample size, the simulation-based approach could be more reliable as the conventional approach here relies on the normality of the data that might be not necessarily true! Although, one issue with the simulation-based approach is that for rows 1 and 2 in the table, there is no observation under the rating score of 1. This causes the rating score not to be represented at all in the simulated samples. 

The R codes below can be easily run using R or the online version of R (on website https://rdrr.io/snippets/) to analyze any arbitrary data: 

The R code for the conventional approach: 

CI_rate_conventional<-function(chance,N,P5,P4,P3,P2,P1){ 

Final_samp<-c() 

N5<-round((P5/100)*N) 

N4<-round((P4/100)*N) 

N3<-round((P3/100)*N) 

N2<-round((P2/100)*N) 

N1<-N-(N5+N4+N3+N2) 

scores<-c(rep(5,N5),rep(4,N4),rep(3,N3),rep(2,N2),rep(1,N1)) 

data_mean<-round(mean(scores),2) 

ci_lower<-t.test(scores,conf.level = 0.95, alternative=”greater”)$conf.int[1] 

lower<-round(unname(ci_lower ),2) 

print(paste0(“The mean rating score is:”,data_mean,”. There is “, chance, “%”, ” chance that the true rating score would be larger than:”, lower, “.”)) 

} 

The R code for the simulation-based approach: 

CI_rates_Simulation<-function(chance,N,P5,P4,P3,P2,P1){ 

Final_samp<-c() 

N5<-round((P5/100)*N) 

N4<-round((P4/100)*N) 

N3<-round((P3/100)*N) 

N2<-round((P2/100)*N) 

N1<-N-(N5+N4+N3+N2) 

scores<-c(rep(5,N5),rep(4,N4),rep(3,N3),rep(2,N2),rep(1,N1)) 

data_mean<-round(mean(scores),2) 

for(i in 1:1000) { 

samp<-sample(scores,N,replace=TRUE) 

mean<-mean(samp) 

Final_samp<-c(Final_samp,mean) 

} 

rate1<-quantile(Final_samp,probs=(100-chance)/100) 

lower<-round(unname(rate1),2) 

print(paste0(“The mean rating score is:”,data_mean,”. There is “, chance, “%”, ” chance that the true rating score would be larger than:”, lower, “.”)) 

} 

In both codes the first attribute represents the CI level (e.g., 95 represents a 95% CI), the second number represents the total number of raters, and the remaining 5 attributes represent the percentage of rating scores from 5 to 1 respectively. 

For example, for the first row in the above table, one needs to run the relevant line below right after running any of the two pieces of code above: 

CI_rate_conventional(95,8,78,11,7,5,0) 

CI_rate_Simulation(95,8,78,11,7,5,0) 

Why Choose BioPharma Services for Your Next Drug Development Project? 

At BioPharma Services, we use a variety of statistical models that fit various scenarios. The models are chosen carefully with deep theoretical considerations by our biostatistics team. In cases where there are multiple methods that can be appropriately used, we select the one that is more comprehensive (considering the case) and more likely to cover all specific aspects of the study. By exploring these principles, applications, strengths, and weaknesses through real-life examples, we can appreciate the diversity and utility of statistical methods in various fields.  

We can explain complex models in a simple language using real life examples to make the model used more understandable to the non-experts. If you want to learn more, fill out our Discovery Call form to learn how we can support your next early phase clinical project.

Written By:

Jafar Soltani Farsani

Senior Biostatistician

BioPharma Services, Inc., a HEALWELL AI and clinical trial services company, is a full-service Contract Clinical Research Organization (CRO) based in Toronto, Canada, specializing in Phase 1 clinical trials 1/2a, Human Abuse Liability(HAL) and Bioequivalence clinical trials for international pharmaceutical companies worldwide. BioPharma Services conducts clinical research operations from its Canadian facility, with access to healthy volunteers and special populations.

Popular Posts