Sample size discussion for Thursday Jan. 8

A step toward answering Andres' question is to address the "number of subjects available" and the "cost" constraints as a single constraint. It would involve generalizing the concept of cost to allow for varying cost of additional observations as the sample size increases.

Increasing costs are often the case in reality; for example, suppose that the most recent charts are already digitized (very accessible, meaning cheap), slightly older charts are organized in a filing cabinet (still relatively cheap), but the pre- 2000 charts are disorganized in a cardboard box due to an office move or flood (much more expensive).

To control the number of subjects available, we could consider the cost of additional observations beyond that point as infinity. In this case, the simple method given by the book would no longer apply, but if costs were known it would not be too difficult to figure out the solution. I think we would be forced to use observations from the single remaining available set regardless of cost.

Another question, related to this one, is how to solve the cost problem when additional samples come in batches at a fixed cost. For example, sorting a box of charts gives 10 observations rather than just one.

NeilSpencer (talk)‎

Combining some of the ideas already mentioned could result in an interesting approach to choosing sample sizes. Consider regarding the currently available clinical results on Activated carbon as fixed with close to zero cost (for Neil suggests that these will likely be digitized and thus have zero incremental lookup cost). This also implicitly assumes this is a new treatment method that has not been conducted in the past. In comparison, the results of alternative treatments must be ascertained through manual searching (per Neil's suggestion). Hence, using section 2.10 on unequal sample sizes, as suggested by Chiara, will yield the number of past records to be manually searched. One intricacy not commented on is that the cost of retrieving one record of alternate treatment will not be uniform. Pre-digitized records are not organized by symptoms or treatment, therefore finding one positive record may actually be poisson distribution among all other causes bringing kids to the ER.

Alternatively, we can view this problem from a different perspective: What fixed cost are we willing to incur in the manual search process? With this cost we could estimate the number of applicable records found, consider this value fixed, and use section 2.10 to determine the number of cases where Activated carbon is used. If this sample size is greater than the number of records currently available simply wait and conduct the study when more records become available. [Here, the waiting time can be considered a generalized cost (likely modelled by an exponential random variable) to be compared against the costs from the search process. In this way the investigator could minimize his costs by balancing the relative sample sizes].

SeanJewell (talk)‎

I agree with Sean's option. Also, if we only care about computing ability, an alternative way to deal with the case that sample size is greater than the number of records available may be to apply bootstrap. I think the cost in this case will be very small. However, it can be argued that this way does not increase the amount of information of original data, so it may not fit the requirement of our design.

JinyuanZhang (talk)‎