Back to PLS Help

missing values, assumptions, num subjects
alenarto
Posted on 07/17/09 17:42:45
Number of posts: 41
alenarto posts:

Hi All ~

I have a few odds 'n ends that keep popping back up... I was wondering what the best answer is and/or if it's documented in any prior publications:

1. The interpretation of bootstrap ratios as approximations of z-scores depends on the bootstrap distribution (correct?) - has anyone systematically investigated what this distribution looks like in imaging datasets?

2. The typical recommendation for #bootstrap and #permutation iterations is 100 and 500 respectively - again, is there justification that this is a good number (e.g., geneticists may resort to tens of thousands of iterations to reach a stable result)?

3.  The software does not currently deal with missing values, correct? (it could in principle deal with 'NaNs' vis meanan no?)

4.  The number of subjects is typically low in imaging - I assume this will affect the reliability estimates of saliences, correct? If this number is too low I presume it will limit the number of available permutations as well?

Many thanks - I realize these are scattered all over the place.
Agatha

Replies:

Untitled Post

I'm Online
jshen
Posted on 07/17/09 17:54:44
Number of posts: 291
jshen replies:


1. I don't know.

2. Yes, 100 / 500 is the suggested number, and the iteration will lead to stable status by then. If it can not reach stable status by then, it will not reach stable status by tens of thousands of iterations.

3. MATLAB will produce NAN values in certain circumstances, but you may get error when displaying such result.

4. Yes. More subjects will help, but you cannot run if you have less than 3 subjects. Single subject analysis in fMRI analysis is basically treating the onset block as subjects, so there are many implied subjects.



Untitled Post
rmcintosh
Posted on 07/18/09 08:54:21
Number of posts: 394
rmcintosh replies:


1. The interpretation of bootstrap ratios as approximations of z-scores depends on the bootstrap distribution (correct?) - has anyone systematically investigated what this distribution looks like in imaging datasets?

Not systematically.  I have examined a lot of data sets and most show a slight skew but otherwise appear Gaussian. I have not seen any that are wildly distributed - (e.g,. bimodal, uniform, etc).  Two things to keep in mind:  its better to consider the bootstrap procedure as a means to impart an estimate of reliability as one would with confidence intervals and that the relation of the ratio to zscores is only an approximation.  The emphasis of the bootstrap procedure is the standard error part.  The ratio is a convenient way to assess the standard error, particularly when you have 100K+ voxels and plotting little confidence intervals around them would be tough to visualize (also a pain to computationally since you need to store the resampling distribution for each element).

2. The typical recommendation for #bootstrap and #permutation iterations is 100 and 500 respectively - again, is there justification that this is a good number (e.g., geneticists may resort to tens of thousands of iterations to reach a stable result)?

For the average data set, these numbers are reasonable to get stable estimates.  For strong effects, you usually get close to aysmptote in estimation with about 20 iterations.  When things are at a borderline (e.g., p=0.06), you may find that doing more permutations would be needed to be sure where the most likely p-value is.  Paradoxically, if you have a larger sample size, you usually need to do more iterations since the total possible number of recombinations of the data is large. 

You need also to realize that genetics data are usually much smaller than what we deal with so the recommended number of iterations is a reasonable compromise between getting stable inferential estimates and having the program finish the iterations in one's lifetime.

3.  The software does not currently deal with missing values, correct? (it could in principle deal with 'NaNs' vis meanan no?)

It does hot.  Its difficult to have a single solution since missing data can come from a number of sources (behavior, brain, etc).  At present, we prefer the user deal with them separately.  If you have suggestions, though, please share them.

4.  The number of subjects is typically low in imaging - I assume this will affect the reliability estimates of saliences, correct? If this number is too low I presume it will limit the number of available permutations as well?

Correct (see point #2).  But, the resampling procedures we use are a better means to deal with the relatively small sample sizes we have in imaging.  There is a hefty literature on the use of resampling statistics to deal with small sample sizes in other disciplines. However, there is a lower limit to this.  I would not be comfortable with a group size below about 6.  This is an issue in patient work and there really isn't  clear solution.   One thing that I advise everyone to do is carefully check the descriptive stats you have for your sample to see if there is wide heterogeneity, etc in your sample before you start resampling.  This will help you better understand your data.





Untitled Post

I'm Online
jshen
Posted on 07/18/09 12:26:56
Number of posts: 291
jshen replies:


Procedures to deal with raw images containing NAN values:
  1. Check NAN value area for all raw images, and create a NAN binary image.
  2. If you are using "Define brain region automatically", skip this step; otherwise, your "Predefined brain region", i.e. brain mask binary image, should exclude any NAN voxels.
  3. Based on NAN binary image, set voxel intensity in NAN area to the minimum value for all raw images.
Now, no NAN values in your source images, and the area of previous NAN values will not be considered in any computation.



Untitled Post

I'm Online
jshen
Posted on 07/19/09 07:15:03
Number of posts: 291
jshen replies:


Forget about my procedures to deal with raw images containing NAN values above. PLSgui program will be modified in the next version to deal with this automatically. Brain region will be modified to exclude any area containing NAN values.



Untitled Post
alenarto
Posted on 07/19/09 18:12:51
Number of posts: 41
alenarto replies:

thank you both for the replies.

randy thanks very much for a very thorough reply. i think i've got it. i'll read up on the bootstrap for small samples - for some reason i need convincing on this one (probably b/c i am weak on the theory here). the z-score point is noted (i apologize for asking about this again - i believe you addressed this previously).

the missing values issue came up because i had to run quick sanity-check analyses on EEG data that had missing electrodes, differing in location across subjects and conditions (and the lab had not interpolated to fill in these values). we're working on correcting the data now.

i agree with you that addressing missing values might be dangerous and would be weary of a button that would make the missing values "go away" - but it would have been nice (time-saving) for the informed user to have either a quick way to interpolate or exclude these from means (meannan.m). thanks jimmy for considering.

best,
agatha




Login to reply to this topic.

  • Keep in touch

Enter your email above to receive electronic messages from Baycrest, including invitations to programs and events, newsletters, updates and other communications.
You can unsubscribe at any time.
Please refer to our Privacy Policy or contact us for more details.

  • Follow us on social
  • Facebook
  • Instagram
  • Linkedin
  • Pinterest
  • Twitter
  • YouTube

Contact Us:

3560 Bathurst Street
Toronto, Ontario
Canada M6A 2E1
Phone: (416) 785-2500

Baycrest is an academic health sciences centre fully affiliated with the University of Toronto