Impact of Different Data Processing Methods

Cornhusker Economics April 10, 2019The Impact of Different Data Processing Methods on Site-specific Management Recommendation

By Taro Mieno, Joe Luck and Zhengzheng Gao


Precision agriculture has the potential to enhance farming profitability substantially via site-specific management of fields. One of the promising ways of generating such profitability-enhancing input is to a use recommendation map is on-farm randomized trials. The process of generating an input (say nitrogen) using a recommendation map typically involves the following steps:

  1. Design and implement randomized input use trial
  1. Collect yield data along with other field characteristics (Slope, Electrical Conductivity, and Organic Matter)
  1. Process the data for statistical analysis
  1. Conduct regression analysis to estimate production function (how the input affect crop yield)
  1. For each of the management units, find the input rate that maximizes profit for that unit

The major focus of this blog post is on step 3: we will examine the sensitivity of regression analysis (step 4) and the resulting recommendation map (step 5) to show the way experimental data is processed. Specifically, we will examine how the way you define analysis units affects steps 4 and 5:

  • Method 1: use experimental trial units as regression analysis units
  • Method 2: divide each of the experimental trial units into sub-units, and use the sub-units as regression analysis units
  • Method 3: use yield monitor yield data points as regression analysis units

Figures below illustrate the three different types of data aggregation approaches.

Depicts plot, subplots and point methods of data aggregation approaches

In academic research studies, all three data processing methods are used. However, to the authors’ knowledge, the consequences of using different data processing methods are not well understood in the context of on-farm field trials. Indeed, there is no consensus among practitioners and researchers about the best way to define the analysis unit.

It is well known that yield data from a yield monitor have measurement errors, and the errors tend to be averaged out more when more yield data points are used to find a mean. Thus, processing Method 1 produces yield analysis points that have the least measurement errors, and Method 3 has the highest measurement errors as it uses yield monitor data points as the unit of analysis without any averaging. However, data aggregation masks important information at the same time. Suppose you suspect that electrical conductivity is an important soil characteristic indicator that affects economically optimal nitrogen rates. Electrical conductivity can vary quite a lot within an experimental unit. Method 1 requires that all the electrical conductivity values within an experimental unit to be averaged, masking the potentially heterogeneous impact of nitrogen on crop yield depending on the level of EC. On the other hand, Method 2 allows researchers to elicit more granulated interactive impacts of nitrogen and EC because EC values are allowed to take different values within an experimental plot due to aggregation at a finer spatial resolution. In Method 3, each of the yield data points is matched with nearby EC values (there are different ways of matching). Therefore, Method 3 discards the least amount of information to statistically identify the interactive impacts of nitrogen and EC. Given these aforementioned trade-offs, it is an important empirical question as to how the data aggregation method affects the final outcome.


For each type of data sets created using the data aggregation methods mentioned above, we will run Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis, which is a statistical method that allows one to identify factors that do not contribute to explaining yield variations. Factors included as explanatory variables are seed rate (seed), Nitrogen rate (NH3), soil electrical conductivity (EC), seed rate square (seedsq), Nitrogen rate squared (NH3sq), EC interacted with Nitrogen rate (ECNH3), and EC interacted with seed rate (ECseed). The variables of particular interest are EC interacted with Nitrogen rate (ECNH3) and EC interacted with seed rate (ECseed). This is because if they are statistically significant (LASSO chose to keep those factors in the model) that means site-specific nitrogen or seed rates should be adjusted based on the value of EC. On the other hand, if they are left out of the model, that would mean that no site-specific nitrogen or seed rates application is necessary.


Here, we use data obtained from nitrogen and seed experiments for corn production run in 2017 on a 70-acre field in Hamilton County, Nebraska. Figure 2 below shows the experimental design with each plot spanning 280 feet × 60 feet. The target Nitrogen rates were 8.37, 16.74, 25.10, and 33.47 gallons per acre. The target seed rates were 28000, 30500, 33000, and 36000 seeds per acre. For this field, data on yield, as-applied nitrogen and seed rates, soil electrical conductivity were collected.


The figure below shows the results of statistical analysis on what factors matter in explaining yield variation depending on the way data is processed and analyzed.

data processing methods

Red indicates that the factor was excluded from the statistical model because it is considered irrelevant, while blue indicates it is important to keep it in the model. For example, none of the variables relating to EC deep (EC_DP) are kept in the model when the plot level data is used to estimate the yield function. As can be seen in the figure, the results from 2017 suggested different data processing methods had strong effects on variable selection results and functional forms. One of the most important findings here is the difference in whether EC deep should be kept in the model or not varies, depending on the way data is processed and analyzed. Point-wise data suggests that one should simply do uniform nitrogen and seed rates application ignoring EC deep, while point-wise data suggests that one should consider site-specific nitrogen and seed rates where the rates are varied based on EC deep. This illustrates how sensitive the final recommendation about nitrogen and seed application rates is to the way we process and analyze data. Unfortunately, we are far from understanding which data processing methods work the best. More research needs to be done on this front. This is an important topic for practitioners (farmers and consultants) because they may have the wrong conclusions about how they should be managing their input use. It is entirely possible that uniform rates are wrongly considered better compared to site-specific rates, and vice-versa.


Both practitioners and researchers do not have a consensus on how to define the statistical analysis unit after on-farm experiments are conducted. We need to be aware of the sensitivity of final input use recommendation as demonstrated here. More research is needed to understand what data processing methods work better than others. Answers to the question can vary context by context.



Taro Mieno
Assistant Professor
Department of Agricultural Economics
University of Nebraska-Lincoln

Joe Luck
Associate Professor
Department of Biological Systems Engineering
University of Nebraska-Lincoln

Zhengzheng Gao
Department of Agricultural Economics
University of Nebraska-Lincoln