Just another WordPress.com site

Before answering this question it is worth explaining what outliers are, and how do they occur.

Outliers are data points which differ extremely from the other data. Outliers are caused simply by chance, measuring or sampling error. As to removing outliers the decision is generally left to the researcher. This is because an outlier might be caused by the instrumentation e.g. low battery in a scale, when measuring weight. This will mean that such an outlier is caused by an error and is not a true score that can be easily re-measured to get exactly the same result. However an outlier might also be a real data point, caused by extremely intelligent (or relatively non intelligent) individual. Removing such an outlier would have been dishonest, because it means removing a real data point- manipulating the data. Normal distribution contains extreme scores at both ends of the slope. This means that the extreme scores are not an effect of poor concentration during the study, but they represent the real score for that particular individual (1).

 

 It is very difficult and time consuming to detect outliers, especially when you have a data containing e.g. 80 000 scores from 150 participants. Rousseeuw and Leroy (1996) described ways of detecting outliers (2). 

 

There are different ways of dealing with outliers, other than simply getting rid of them. This might sometimes cause problems, because they might be real scores. What we can do to deal with them is to use robust statistics such as median, instead of mean. Median is not as sensitive to outliers as mean, because it is the point in the middle and an outlier only pushes it slightly by 1 place. Whereas mean takes into account the value of all numbers, therefore a single outlier can strongly affect the data (3).Another way of dealing with outliers is using nonparametric tests (4). The reason for this is that they do not require assumption of the normality or homogeneity of variance, and again use median instead of mean.

 

When carrying out a research it is very important to get valid results. Therefore accurate data needs to back them up. It is the researcher’s responsibility to judge all the outliers and to decide whether to get rid of them or ‘work around them’. It is not dishonest to remove an outlier as long as a researcher has some evidence to suspect that such an outlier is not a real data point.

 

Further reading:


(1)          http://stattrek.com/help/glossary.aspx?target=normal_distribution


(2)          Rousseeuw, P., &  Leroy, A. (1996). Robust Regression and Outlier Detection. John Wiley &       Sons., 3rd edition.

(3)          http://www.ltcconline.net/greenl/courses/201/descstat/mean.htm

(4)          http://www.une.edu.au/WebStat/unit_materials/c6_common_statistical_tests/nonparametr ic_test.html

(5)         http://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide-              2.php

Advertisements

Comments on: "Is it dishonest to remove outliers from your data?" (3)

  1. I love your way of describing what a outlier is and how it’s dealt with. It flows nicely making it a lovely read. On to the subject of outliers I would just like to point out once again the applied importance of outlier in certain industries. In pharmaceuticals for example outliers in data can be extremely important because they can point out a possible side effect to a drug, in this case if would be very irresponsible to disregard them. From a psychological point of view if you are looking at how effective a therapy is and there are outliers it could mean that there is a problem. Investigating as to why there is a problem could lead to a discovery that could help a lot of people, which is why outliers can be important.

    References
    Side effects – http://www.drugs.com/sfx/
    Evaluating Therapy – http://counsellingresource.com/lib/therapy/types/effectiveness/

    • I totally agree with you. Disregarding an outlier in e.g. pharmaceutics industry as you mentioned, WOULD BE VERY irresponsible. However pharmacists wouldn’t have simply deleted an outlier, or deal with it the other way. They would have to invite the individuals whose scores were significantly different from the rest of the tested participants for further tests to try and determine the cause of the ‘different’ score. They cannot possibly release a drug knowing that it might cause unknown side effects.

      http://www.drugs.com/sfx/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Tag Cloud

%d bloggers like this: