Imagebased
Vision Model:
Vision models can have different forms based on the
purpose when the models were developed. For example, most of the existing spatial
vision models take visual stimuli properties such as spatial frequency, contrast,
location of some welldefined spatial patterns as the model inputs. These models
have served their purposes very well as to probe the underlying visual mechanisms
in processing spatial patterns. In addition, there are also imagebased vision models
(e.g., Watson and
Ahumada, 2005), where the inputs to the models are images or pixels based distributions.
Such models can be used in real industry applications to handle with arbitrary target
shape.
The following is a description of our work on imagebased
vision model, which simulates biological visual image processing in the visual system.
Visual Image
Processing:
We developed a framework to simulate and compute human
visual performance based on the ideas of implicit masking, nolinear processes, and
other wellknown properties of the visual system that have been used in many models.
The basic functional components of this model include a frontend lowpass filter,
a retinal nonlinearity, a cortical frequency representation and a frequencydependent
nonlinear process, and finally a decision stage.
Lowpass Filtering: When the light modulated information of an
image enters into human eyes, it passes through the optical lens of the eye and
is captured by photoreceptors in the retina. One function of photoreceptors is to
sample the continuous spatial variation of the image discretely.The cone signals are further processed through horizontal cells, bipolar
cells, amacrine cells, and ganglion cells with some resampling. From an image processing
point of view, the effects of optical lens, sampling, and resampling in the retinal
mosaic are lowpass filtering.
We estimate the frontend filter from psychophysical
experiments. It has been shown that the visual behavior at high spatial frequencies
follows an exponential curve. Yang et al. (1995) extrapolated this relationship
to low spatial frequencies to describe the whole frontend filter with an exponential
function of spatial frequency:
LPF(f)
= Exp(a
f),
(1)
Where a
is a parameter specifying the rate of attenuation for a specific viewing condition.
Yang and Stevenson (1997) modified the formula to account for the variation in
a with the mean luminance of the image:
a = a_{0 }
+ d L_{0}^{0.5},
(2)
where a_{0} and
d are two parameters and L_{0}
the mean luminance of the image.
Retinal Compressive Nonlinearity:
In the retina, there are several major layers of cells, starting from photoreceptors
including rods and three types of cones, to horizontal cells, bipolar cells, amacrine
cells, and finally to ganglion cells where the information is transmitted out of
the retina via optic nerve fibers to the central brain. Retinal processes include
a light adaptation, where the retina becomes less sensitive if continuously exposed
to bright light. The adaptation effects are spatially localized . In the current
model, the adaptation pools are assumed to be constrained by ganglion cells with
an aperture window:
W_{g}(x,
y) =Exp[(x^{2} + y^{2})/(2r_{g}^{2})]/(2pr_{g}^{2}),
(3)
where r_{g} is the standard
deviation of the aperture. The adaptation
signal at the level of ganglion cells I_{g} is the convolution of
the lowpassed input image I_{c} with the window function W_{g}. In this algorithm, the window profile
is approximated as spatially invariant by considering only foveal vision. The retinal
signal I_{R} is the output of a compressive nonlinearity.
The form of this nonlinear function is assumed here to be the NakaRushton
equation, which has been widely used in models of retinal light adaptation. One major difference here is that the
adaptation signal I_{g} in the denominator is a pooled signal, which
is similar to a divisive normalization process:
I_{g} = w_{0} (1 + I_{0}^{n})I_{c}^{n}/(I_{g}^{n
}+ I_{0}^{n}w_{0}^{n}),
(4)
where n and I_{0}
are parameters that represent the exponent and the semisaturation constant of the
NakaRushton equation, respectively, and w_{0} is a reference luminance
value. In conditions where I_{c} and I_{g} are all
equal to w_{0}, the retinal output signal is the same as the input
signal strength.
Cortical Compressive Nonlinearity:
Simple cells and complex cells in the visual striate cortex usually respond to stimuli
of limited ranges in spatial frequency and orientation. To capture this frequency
and orientationspecific nonlinearity, one can transform the image I_{R}
from a spatial domain to a frequency domain representation via a Fourier transform
to T(f_{x}, f_{y}), and divided by n_{x} and
n_{y} to normalize the amplitude in the frequency domain.
Here f_{x} and f_{y} are the spatial frequencies
in x and y directions, respectively, and n_{x} by n_{y}
is the number of image pixels.
These cells also exhibit nonlinear properties;
their firing rate does not increase until the stimulation strength is above a threshold
level and the firing rate saturates when the stimulation strength is very strong.
In the model calculation, the signal in the frequency domain passes through the
same type of nonlinear compressive transform as it did in the retinal processing.Following the concept of frequency spread
in implicit masking (see Fig 2), one major step here is to compute the frequency
spreading that affect the masking signal in the denominator of the nonlinear formula.
In this model, the signal strength in the masking pool, T_{m}(f_{x},
f_{y}), is the convolution of the absolute signal amplitude T(f_{x},
f_{y}) and an exponential window function:
W_{c}(f_{x}, f_{y})
= Exp[(f_{x}^{2} + f_{y}^{2})^{0.5}/s],
(5)
Where s
correlates with the extent of the frequency spreading and the bandwidth of frequency
channels. As the bandwidth of frequency channels increases with the spatial frequency,
one should expect that the s
value increases with spatial frequency.
To simplify the computation, however,
this value is approximated as a fixed value in the current algorithm. Applying the
same form of compressive nonlinearity as in the retina, the cortical signal in the
frequency domain is expressed as:
T_{c}
= sign(T) w_{0} (1 +T_{0}^{v}) T^{v}/(T_{m}^{v
}+ T_{0}^{v}w_{0}^{v}) ,
(6)
where v and T_{0}
are parameters that represent the exponent and the semisaturation constant of the
NakaRushton equation for the cortical nonlinear compression, respectively. The
term T_{m} in the denominator includes the energy spread of the DC
component (i.e., at 0 cpd) of the spatial pattern.
This component is processed in the same way as other frequency maskers, if
there are any, under Eq. 6. Thus, the
concept of implicit masking is naturally implemented in the image processing framework.
In summary, the major process in the cortex is modeled by a compressive nonlinearity
applying to the spatial frequency and orientation components. The cortical image
representation in the frequency domain is given by the function T_{c}.This function can be used to calculate
visual responses and to simulate visual performance on detecting spatial patterns,
as well as for estimating perceived brightness.
