As part of our quality control efforts, when we ingest new sources of station precipitation time series, we compare each new station to other stations previously ingested into our database. We want to weed out stations that are statistically and/or visually different from their neighbors. To this end, we have developed software that calculates correlation and difference statistics, combines to produce an index that we can use to rank the goodness of fits with the neighboring stations. We also generate graphics that display the time series of the station with the neighboring stations time series so that we can visually inspect stations where the statistics suggest a poor fit with its neighbors. Individually, we examine the plots and flag those that we suspect may contain erroneous data. We include information about elevation, climatology, geographic locations and station sources in the plots so that we can determine if the differences we see can be explained by these other variables. See the example plot below.
The algorithm selects a station from our precipitation database for a given source and searches for neighbor stations within an iteratively increasing radius from the station until a minimum number of "quality" stations is found. This number is usually three but it can be adjusted if stations are sparse in the area of interest. A quality station is one with enough observations (5-15 depending on the data set's temporal length) within the time series of interest. For CHIRPS we are interested in the time period 1981 to present. At each search iteration, if the minimum number of quality stations is not found, the search radius is increased up to a maximum of 150km. If the minimum is not met at the maximum search radius, then the station is skipped and noted in a log file. When enough quality neighbors are found, the a distance weighted mean (DWM) of the neighbors is calculated and the correlation coefficient (R) and mean standard error (MSE) between the station and the mean are calculated. These values are used to compute a composite index of the overall "badness" of the station compared to the neighbor's DWM. The badness index (BI) is calculated with:
The BI is calculated for the 3 wettest months at the station location and added together to produce a "total badness index" (TBI) for each station. The time series of the station (blue), and it's neighbors (distance weighted shades of grey), is plotted along with the DWM (green) for each of the wettest three months. Each of these graphics contains; the latitude and longitude and elevation of the station and it's neighbors, as well as the names and station identification numbers. A plot of the locations of each station is generated in the upper right corner of the graph as color coded symbols to depict the differences in elevation as compared to the station. The climatological precipitation value for the given month, and at each nearby station location, is plotted along the right side of the plot as a reference to use in determining station validity. The names of the three graphics files are then prepended with the TBI to allow for numerical sorting by the computer operating system's file system. Typically, only stations with TBI greater than 100 are saved for viewing since TBI values below that are stations in very good agreement with their neighbors.
In the above plot, we have a station in Kenya, Majimazurifores, with an R value of 0.59 and an MSE of 32 from the month of May. The combined Badness Index is low, at 36, showing a good fit of the station with its neighboring stations. The station's time series is plotted in blue and the DWM is plotted in green. All other stations are plotted in varying shades of grey assigned by the distance from the station under examination. Lighter shades signify greater distances from the station. Green and blue dots signify which values were used in the statistical calculations. Ancillary information about the station and its neighbors are printed below the plot. This information is used to help explain the differences between the station and its neighbors. All the neighboring station information (elevations, distances, station database sequence numbers (seqnums), source seqnums and station names) maintain a consistent order to assist in cross referencing. A future version will order these values by station distances.
The station locations are plotted in the upper left corner of the graph. The station under examination is plotted in the middle as a triangle and the neighbors are plotted geographically in the box that encompasses the 150km search radius used in the search algorithm. The neighbor stations are symbolized according to the magnitude of the difference in elevation as compared to the station. Color indicates the direction of the difference with respect to the station. In this case, station has one very close neighbor station at 7km away. Another nearby station is between 100 and 500 meters higher that it and four other neighbors that are between 100 and 500 meters below it.
Finally, the CHG CHPClim value at the station is plotted as a blue bar to the right of the graph as well as well as color coded bars the each neighbor. This is used to display the climatology for that location during the given month and is used as a reality check to help filter out erroneous stations.
Through the use of these graphics we have been able to remove station data that did not compare well with neighboring stations and/or the climatology of the location. Below is an example of a station that did not compare well to either and its 3 month total badness index summed to 300. As can be seen, the station values are two to three times those of the climatology and its neighbors. The close proximity of the neighbors and similarities in elevation suggests that the station should have similar rainfall totals but do not so we have effectively removed this station for use in the CHIRPS data product.
Since CHIRPS is a long term data product that is used for trend analysis, it is important to maintain consistency throughout the data record. In the below examples it can be seen that in all three of the wettest months a shift in precipitation values appeared in the late 1990's. This inhomogeneity was detected upon visual inspection and the station removed from the data stream used in CHIRPS.