## Fundamentals of the DFT (fft) Algorithms

In this article, a physical explanation of the fundamentals of the DFT (fft) algorithms is presented in terms of waveform decomposition. After reading the article and trying the examples, the reader is expected to gain a clear understanding of the basics of the mysterious DFT (fft) algorithms.

## Optimization of Synthesis Oversampled Complex Filter Banks

An important issue with oversampled FIR analysis filter banks (FBs) is to determine inverse synthesis FBs, when they exist. Given any complex oversampled FIR analysis FB, we first provide an algorithm to determine whether there exists an inverse FIR synthesis system. We also provide a method to ensure the Hermitian symmetry property on the synthesis side, which is serviceable to processing real-valued signals. As an invertible analysis scheme corresponds to a redundant decomposition, there is no unique inverse FB. Given a particular solution, we parameterize the whole family of inverses through a null space projection. The resulting reduced parameter set simplifies design procedures, since the perfect reconstruction constrained optimization problem is recast as an unconstrained optimization problem. The design of optimized synthesis FBs based on time or frequency localization criteria is then investigated, using a simple yet efficient gradient algorithm.

## Hybrid Floating Point Technique Yields 1.2 Gigasample Per Second 32 to 2048 point Floating Point FFT in a single FPGA

Hardware Digital Signal Processing, especially hardware targeted to FPGAs, has traditionally been done using fixed point arithmetic, mainly due to the high cost associated with implementing floating point arithmetic. That cost comes in the form of increased circuit complexity. The increase circuit complexity usually also degrades maximum clock performance. Certain applications demand the dynamic range offered by floating point hardware, and yet require the speeds and circuit density usually associated with fixed point hardware. The Fourier transform is one DSP building block that frequently requires floating point dynamic range. Textbook construction of a pipelined floating point FFT engine capable of continuous input entails dozens of floating point adders and multipliers. The complexity of those circuits quickly exceeds the resources available on a single FPGA. This paper describes a technique that is a hybrid of fixed point and floating point operations designed to significantly reduce the overhead for floating point. The results are illustrated with an FFT processor that performs 32, 64, 128, 256, 512, 1024 and 2048 point Fourier transforms with IEEE single precision floating point inputs and outputs. The design achieves sufficient density to realize a continuous complex data rate of 1.2 Gigasamples per second data throughput using a single Virtex4-SX55-10 device.

## Introduction to Digital Signal Processing

●2 commentsNice slides introducing Digital Signal Processing.

## Efficient Digital Fiilters

●1 commentWhat would you do in the following situation? Let ’ s say you are diagnosing a DSP system problem in the field. You have your trusty laptop with your development system and an emulator. You figure out that there was a problem with the system specifications and a symmetric FIR filter in the software won ’ t do the job; it needs reduced passband ripple, or maybe more stopband attenuation. You then realize you don ’ t have any filter design software on the laptop, and the customer is getting angry. The answer is easy: You can take the existing filter and sharpen it. Simply stated, filter sharpening is a technique for creating a new filter from an old one [1] – [3] . While the technique is almost 30 years old, it is not generally known by DSP engineers nor is it mentioned in most DSP textbooks.

## An application of neural networks to adaptive playout delay in VoIP

The statistical nature of data traffic and the dynamic routing techniques employed in IP networks results in a varying network delay (jitter) experienced by the individual IP packets which form a VoIP flow. As a result voice packets generated at successive and periodic intervals at a source will typically be buffered at the receiver prior to playback in order to smooth out the jitter. However, the additional delay introduced by the playout buffer degrades the quality of service. Thus, the ability to forecast the jitter is an integral part of selecting an appropriate buffer size. This paper compares several neural network based models for adaptive playout buffer selection and in particular a novel combined wavelet transform/neural network approach is proposed. The effectiveness of these algorithms is evaluated using recorded VoIP traces by comparing the buffering delay and the packet loss ratios for each technique. In addition, an output speech signal is reconstructed based on the packet loss information for each algorithm and the perceptual quality of the speech is then estimated using the PESQ MOS algorithm. Simulation results indicate that proposed Haar-Wavelets-Packet MLP and Statistical-Model MLP adaptive scheduling schemes offer superior performance.

## HIERARCHICAL MOTION ESTIMATION FOR EMBEDDED OBJECT TRACKING

This paper presents an algorithm developed to provide automatic motion detection and object tracking embedded within intelligent CCTV systems. The algorithm development focuses on techniques which provide an efficient embedded systems implementation with the ability to target both FPGA and DSP devices. During algorithm development constraints on hardware implementation have been fully considered resulting in an algorithm which, when targeted at current FPGA devices, will take full advantage of the DSP resource commonly provided in such devices. The hierarchical structure of the proposed algorithm provides the system with a multi-level motion estimation process allowing low resolution estimation for motion detection and further higher resolution stages for motion estimation. An initial MATLAB prototype has demonstrated this algorithm capable of object motion estimation while compensating for camera motion, allowing a moving object to be tracked by a moving camera.

## An FPGA Implementation of Hierarchical Motion Estimation for Embedded Oject Tracking

This paper presents the hardware implementation of an algorithm developed to provide automatic motion detection and object tracking functionality embedded within intelligent CCTV systems. The implementation is targeted at an Altera Stratix FPGA making full use of the dedicated DSP resource. The Altera Nios embedded processor provides a platform for the tracking control loop and generic Pan Tilt Zoom camera interface. This paper details the explicit functional stages of the algorithm that lend themselves to an optimised pipelined hardware implementation. This implementation provides maximum data throughput, providing real-time operation of the described algorithm, and enables a moving camera to track a moving object in real time.

## Hidden Markov Model based recognition of musical pattern in South Indian Classical Music

Automatic recognition of musical patterns plays a crucial part in Musicological and Ethno musicological research and can become an indispensable tool for the search and comparison of music extracts within a large multimedia database. This paper finds an efficient method for recognizing isolated musical patterns in a monophonic environment, using Hidden Markov Model. Each pattern, to be recognized, is converted into a sequence of frequency jumps by means of a fundamental frequency tracking algorithm, followed by a quantizer. The resulting sequence of frequency jumps is presented to the input of the recognizer which use Hidden Markov Model. The main characteristic of Hidden Markov Model is that it utilizes the stochastic information from the musical frame to recognize the pattern. The methodology is tested in the context of South Indian Classical Music, which exhibits certain characteristics that make the classification task harder, when compared with Western musical tradition. Recognition of 100% has been obtained for the six typical music pattern used in practise. South Indian classical instrument, flute is used for the whole experiment.

## Design and implementation of odd-order wave digital lattice lowpass filters, from specifications to Motorol DSP56307EVM module

This thesis is dedicated to applying and developing explicit formulas for the design and implementation of odd-order lattice Lowpass wave digital filters (WDFs) on a Digital Signal Processor (DSP), such as a Motorola DSP56307EVM (Evaluation Module). The direct design method of Gazsi for filter types such as Butterworfh, Chebyshev, inverse Chebyshev, and Cauer (Elliptic) provides a straightforward method for calculating the coefficients without an extensive knowledge of digital signal processing. A program package to design and implement odd-order WDFs, including detailed procedures and examples, is presented in this thesis and includes not only the calculations of the coefficients, but also the simulation on a MATLAB platform and an implementation on a Motorola DSP56307EVM board. It is very quick, effective and convenient to obtain the coefficients when the user enters a few parameters according to the general specifications; to verify the characteristics of the designed filter; to simulate the filter on the MATLAB platform; to implement the filter on the DSP board; and to compare the results between the simulation and the implementation.

## Adaptive distributed noise reduction for speech enhancement in wireless acoustic sensor networks

An adaptive distributed noise reduction algorithm for speech enhancement is considered, which operates in a wireless acoustic sensor network where each node collects multiple microphone signals. In previous work, it was shown theoretically that for a stationary scenario, the algorithm provides the same signal estimators as the centralized multi-channel Wiener filter, while significantly compressing the data that is transmitted between the nodes. Here, we present simulation results of a fully adaptive implementation of the algorithm, in a non-stationary acoustic scenario with a moving speaker and two babble noise sources. The algorithm is implemented using a weighted overlap-add technique to reduce the overall input-output delay. It is demonstrated that good results can be obtained by estimating the required signal statistics with a long-term forgetting factor without downdating, even though the signal statistics change along with the iterative filter updates. It is also demonstrated that simultaneous node updating provides a significantly smoother and faster tracking performance compared to sequential node updating.

## BLAS Comparison on FPGA, CPU and GPU

High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using double-precision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on state-of-the-art devices. On the FPGA, we have developed parameterized modular implementations for the dot product and Gaxpy or matrix-vector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a high-throughput accumulator to perform an efficient reduction of floating point values. To support scalability to large data-sets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms.

## Algorithm Adaptation and Optimization of a Novel DSP Vector Co-processor

The Division of Computer Engineering at Linköping's university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels.This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors' microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn't allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.

## Auditory Component Analysis Using Perceptual Pattern Recognition to Identify and Extract Independent Components From an Auditory Scene

The cocktail party effect, our ability to separate a sound source from a multitude of other sources, has been researched in detail over the past few decades, and many investigators have tried to model this on computers. Two of the major research areas currently being evaluated for the so-called sound source separation problem are Auditory Scene Analysis (Bregman 1990) and a class of statistical analysis techniques known as Independent Component Analysis (Hyvärinen 2001). This paper presents a methodology for combining these two techniques. It suggests a framework that first separates sounds by analyzing the incoming audio for patterns and synthesizing or filtering them accordingly, measures features of the resulting tracks, and finally separates sounds statistically by matching feature sets and making the output streams statistically independent. Artificial and acoustical mixes of sounds are used to evaluate the signal-to-noise ratio where the signal is the desired source and the noise is comprised of all other sources. The proposed system is found to successfully separate audio streams. The amount of separation is inversely proportional to the amount of reverberation present.

## HIERARCHICAL MOTION ESTIMATION FOR EMBEDDED OBJECT TRACKING

This paper presents an algorithm developed to provide automatic motion detection and object tracking embedded within intelligent CCTV systems. The algorithm development focuses on techniques which provide an efficient embedded systems implementation with the ability to target both FPGA and DSP devices. During algorithm development constraints on hardware implementation have been fully considered resulting in an algorithm which, when targeted at current FPGA devices, will take full advantage of the DSP resource commonly provided in such devices. The hierarchical structure of the proposed algorithm provides the system with a multi-level motion estimation process allowing low resolution estimation for motion detection and further higher resolution stages for motion estimation. An initial MATLAB prototype has demonstrated this algorithm capable of object motion estimation while compensating for camera motion, allowing a moving object to be tracked by a moving camera.

## A DSP Implementation of OFDM Acoustic Modem

●1 commentThe success of multicarrier modulation in the form of OFDM in radio channels illuminates a path one could take towards high-rate underwater acoustic communications, and recently there are intensive investigations on underwater OFDM. In this paper, we implement the acoustic OFDM transmitter and receiver design of [4, 5] on a TMS320C6713 DSP board. We analyze the workload and identify the most time-consuming operations. Based on the workload analysis, we tune the algorithms and optimize the code to substantially reduce the synchronization time to 0.2 seconds and the processing time of one OFDM block to 1.7 seconds on a DSP processor at 225 MHz. This experimentation provides guidelines on our future work to reduce the per-block processing time to be less than the block duration of 0.23 seconds for real time operations.

## Novel Method of Showing Frequency Transients in the Fourier Transform and it’s Application in Time-Frequency Analysis

Fourier Transform in the frequency domain is modified to also analyse frequency transients i.e. changes in the frequency spectrum with time variable of any order. This is analytically, a very useful tool as there are many problems where frequency variation with time has to be analyzed e.g. Doppler shift, Light through different mediums in time and space. Numerical calculations are usually done for such problems when needed. Here, Fourier transform is analyzed to incorporate more variables that simultaneously do the Time lag-Frequency Analysis (TLFA) from Fourier Transform by changing the Fourier Operator. Also, the Frequency Derivative Analysis (FDA) of any order can be analyzed from Fourier Transform. Validity of the operator is examined using Eigen value analysis and operator algebra.

## A NEW PARALLEL IMPLEMENTATION FOR PARTICLE FILTERS AND ITS APPLICATION TO ADAPTIVE WAVEFORM DESIGN

Sequential Monte Carlo particle ﬁlters (PFs) are useful for estimating nonlinear non-Gaussian dynamic system parameters. As these algorithms are recursive, their real-time implementation can be computationally complex. In this paper, we analyze the bottlenecks in existing parallel PF algorithms, and we propose a new approach that integrates parallel PFs with independent Metropolis-Hastings (PPF-IMH) algorithms to improve root mean-squared estimation error performance. We implement the new PPF-IMH algorithm on a Xilinx Virtex-5 ﬁeld programmable gate array (FPGA) platform. For a onedimensional problem and using 1,000 particles, the PPF-IMH architecture with four processing elements utilizes less than 5% Virtex-5 FPGA resources and takes 5.85 μs for one iteration. The algorithm performance is also demonstrated when designing the waveform for an agile sensing application.

## EFFICIENT MAPPING OF ADVANCED SIGNAL PROCESSING ALGORITHMS ON MULTI-PROCESSOR ARCHITECTURES

●2 commentsModern microprocessor technology is migrating from simply increasing clock speeds on a single processor to placing multiple processors on a die to increase throughput and power performance in every generation. To utilize the potential of such a system, signal processing algorithms have to be efficiently parallelized so that the load can be distributed evenly among the multiple processing units. In this paper, we study several advanced deterministic and stochastic signal processing algorithms and their computation using multiple processing units. Specifically, we consider two commonly used time-frequency signal representations, the short-time Fourier transform and the Wigner distribution, and we demonstrate their parallelization with low communication overhead. We also consider sequential Monte Carlo estimation techniques such as particle filtering, and we demonstrate that its multiple processor implementation requires large data exchanges and thus a high communication overhead. We propose a modified mapping scheme that reduces this overhead at the expense of a slight loss in accuracy, and we evaluate the performance of the scheme for a state estimation problem with respect to accuracy and scalability.

## Closing the gap: CPU and FPGA Trends in sustainable floating-point BLAS performance

Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks — as long as floating-point arithmetic is not required. Fueled by the advance of Moore’s Law, FPGAs are rapidly reaching sufficient densities to enhance peak floating-point performance as well. The question, however, is how much of this peak performance can be sustained. This paper examines three of the basic linear algebra subroutine (BLAS) functions: vector dot product, matrix-vector multiply, and matrix multiply. A comparison of microprocessors, FPGAs, and Reconfigurable Computing platforms is performed for each operation. The analysis highlights the amount of memory bandwidth and internal storage needed to sustain peak performance with FPGAs. This analysis considers the historical context of the last six years and is extrapolated for the next six years.