march wb, xiao b, yu cd,  biros g
 askit: an efficient parallel library for high-dimensional kernel summations
siam journal on scientific computing, 2016 (in print)

Kernel-based methods are a powerful tool in a variety of machine learning and computational statistics methods. A key bottleneck in these methods is computations involving the kernel matrix, which scales quadratically with the problem size. Previously, we introduced ASKIT, an efficient, scalable, kernel-independent method for approximately evaluating kernel matrix-vector products. ASKIT is based on a novel, randomized method for efficiently factoring off-diagonal blocks of the kernel matrix using approximate nearest neighbor information. In this context, ASKIT can be viewed as an algebraic fast multipole method for arbitrary dimensions. In this paper, we introduce our open-source implementation of ASKIT. Features of our ASKIT library include: linear dependence on the input dimension of the data, the ability to approximate kernel functions with no prior information on the kernel, and scalability to tens of thousands of compute cores and data with billions of points or hundreds of dimensions. We also introduce some new extensions and improvements of ASKIT, included in our library. We introduce a new method for adaptively selecting approximation ranks and correctly partition the nearest neighbor information, both of which improve the performance of ASKIT over our previous implementation. We describe the ASKIT algorithm in detail, and collect and summarize our previous theoretical complexity and error bounds in one place. We present a brief selection of experimental results illustrating the accuracy and scalability of ASKIT. We then provide some details and guidance for users of ASKIT.

xiao b,  biros g
 parallel algorithms for nearest neighbor search problems in high dimensions
siam journal on scientific computing, 2016 (in print)

The nearest neighbor search problem in general dimensions finds application in computational geometry, computational statistics, pattern recognition, and machine learning. Although there is a significant body of work on theory and algorithms, surprisingly little work has been done on algorithms for high-end computing platforms and no open source library exists that can scale efficiently to thousands of cores. In this paper, we present algorithms and a library built on top of Message Passing Interface (MPI) and OpenMP that enable nearest neighbor searches to hundreds of thousands of cores for arbitrary dimensional datasets.
The library supports both exact and approximate nearest neighbor searches. The latter is based on iterative, randomized, and greedy KD-tree searches. We describe novel algorithms for the construction of the KD-tree, give complexity analysis, and provide experimental evidence for the scalability of the method. In our largest runs, we were able to perform an all-neighbors query search on a 13 TB synthetic dataset of 0.8 billion points in 2,048 dimensions on the 131K cores on Oak Ridge’s XK6 “Jaguar” system. These results represent several orders of magnitude improvement over current state-of-the-art methods. Also, we apply our method to non-synthetic data from machine learning data repositories. For example, we perform an all-nearest-neighbor search on a variant of the ”MNIST” handwritten digit dataset with 8 million points in 784 dimensions on 16,384 cores of the ”Stampede” system at the Texas Advanced Computing Center, achieving less than one second per RKDT iteration.

yu c, huang j, austin w, xiao b,  biros g
perfomance optimization of the k-nearest neighbors kernel on x86 architectures
acm/ieee SC'15, austin tx, november 2015

Nearest neighbor search is a cornerstone problem in computational geometry, non-parametric statistics, and machine learning. Using exhaustive search to find all the pairwise k nearest-neighbors for a set of N points in d dimensions requires quadratic work. Fast algorithms can reduce the complexity to nearly linear work (although in high dimen- sions the searches are approximate). Such fast algorithms require the solution of many small-size nearest-neighbor prob- lems exactly using exhaustive search. We term these small- problem-size exact searches as the “kNN kernel” or simply “kNN”. We propose an efficient implementation of kNN and its performance analysis on x86 architectures. We use multi- threading, blocking, vectorization, and highly optimized as- sembly code for the most critical part of the algorithm. The key insight is that by fusing the distance calculation with the neighbor selection we can significantly improve memory throughput. We present and validate a performance model of the algorithm and we use it for parameter tuning. We perform an experimental study in which we vary the number of points N, the dimension d, and the number of nearest neighbors k. Overall we observe significant speedups. For example, when searching for 16 neighbors in a point dataset with 1.6 million points in 64 dimensions, our kernel is over four times faster than existing methods.

march b, xiao b, tharakan s, yu c, biros g
a kernel independent FMM in general dimensions 
acm/ieee SC'15, austin tx, november 2015

We introduce a general-dimensional, kernel-independent, algebraic fast multipole method and apply it to kernel regres- sion. The motivation for this work is the approximation of kernel matrices, which appear in mathematical physics, ap- proximation theory, non-parametric statistics, and machine learning. Existing fast multipole methods are asymptotically optimal, but the underlying constants scale quite badly with the ambient space dimension. We introduce a method that mitigates this shortcoming; it only requires kernel evaluations and scales well with the problem size, the number of processors, and the ambient dimension—as long as the intrinsic dimension of the dataset is small. We test the performance of our method on several synthetic datasets. As a highlight, our largest run was on an image dataset with 10 million points in 246 dimensions.

march b, xiao b, tharakan s, yu c, biros g
robust treecode approximation for kernel machines       
Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD15), ACM, Sydney, Australia, August 2015

Since exact evaluation of a kernel matrix requires $\bigO(N^2)$ work,  scalable learning algorithms using kernels must approximate the kernel matrix. This approximation must be robust to the kernel parameters,
for example the bandwidth for the Gaussian kernel. We consider two approximation methods: Nystrom and an algebraic treecode developed in our group. Nystrom methods construct a global low-rank approximation of the kernel matrix. Treecodes approximate just the off-diagonal blocks, typically using a hierarchical decomposition. We present a theoretical error analysis of our treecode and relate it to the error of Nystrom methods. Our analysis reveals how the block-rank structure of the kernel matrix controls the performance of the treecode. We evaluate our treecode by comparing it to the classical Nystrom method and a state-of-the-art fast approximate Nystrom method. We test the kernel matrix approximation accuracy for
several different bandwidths and datasets. On the MNIST2M dataset (2M points in 784 dimensions) for a Gaussian kernel with bandwidth $h=1$, the  Nystrom methods' error is over 90\% whereas our treecode delivers error less than 1\%. We also test the performance of the three methods on binary classification using two models: a Bayes classifier and kernel ridge regression. Our evaluation reveals the existence of bandwidth values that should be examined in cross-validation but whose corresponding kernel matrices cannot be approximated well by Nystrom methods. In contrast, the treecode scheme performs much better for these values.

march b, xiao b, tharakan s, yu c, biros g
An algebraic parallel treecode in arbitrary dimensions       
Proceedings of the 29st IEEE  International Parallel & Distributed Processing Symposium (IPDPS15), Hyderabad, India, May 2015

We present a parallel treecode for fast kernel summation in high dimensions—a common problem in data analysis and computational statistics. Fast kernel summations can be viewed as approximation schemes for dense kernel matrices. Treecode algorithms (or simply treecodes) construct low-rank approximations of certain off-diagonal blocks of the kernel matrix. These blocks are identified with the help of spatial data structures, typically trees. There is extensive work on treecodes and their parallelization for kernel summations in three dimensions, but there is little work on high-dimensional problems. Recently, we introduced a novel treecode, ASKIT, which resolves most of the shortcomings of existing methods.
We introduce novel parallel algorithms for ASKIT, derive complexity estimates, and demonstrate scalability on synthetic, scientific, and image datasets. In particular, we introduce a local essential tree construction that extends to arbitrary dimensions in a scalable manner. We introduce data transformations for memory locality and use GPU acceleration. We report results on the “Maverick” and “Stampede” systems at the Texas Advanced Computing Center. Our largest computations involve two billion points in 64 dimensions on 32,768 x86 cores and 8 million points in 784 dimensions on 16,384 x86 cores.

gholami a, sundar h, malhotra d, biros g
FFT,FMM, or Multigrid? A comparative study of state-of-the-art Poisson solvers       
Submitted for publication.

We discuss the fast solution of the Poisson problem on a unit cube. We benchmark the performance of the most scalable methods for the Poisson problem: the Fast Fourier Transform (FFT), the Fast Multipole Method (FMM), the geometric multigrid (GMG) and algebraic multigrid (AMG). The GMG and FMM are novel parallel schemes using high-order approximation for Poisson problems developed in our group. The FFT code is from P3DFFT library and AMG code from ML Trilinos library. We examine and report results for weak scaling, strong scaling, and time to solution for uniform and highly refined grids. We present results on the Stampede system at the Texas Advanced Computing Center and on the Titan system at the Oak Ridge National Laboratory. In our largest test case, we solved a problem with 600 billion unknowns on 229,379 cores of Titan. Overall, all methods scale quite well to these problem sizes. We have tested all of the methods with different source distributions. Our results show that FFT is the method of choice for smooth source functions that can be resolved with a uniform mesh. However, it loses its performance in the presence of highly localized features in the source function. FMM and GMG considerably outperform FFT for those cases.

malhotra d, biros g
a distributed memory fast multipole method for volume potentials
Submitted for publication.

 The solution of a constant-coefficient elliptic partial differential equation (PDE) can be computed using an integral transform: a convolution with the fundamental solution of the PDE, also known as a volume potential. We present a Fast Multipole Method (FMM) for computing volume potentials and use them to construct spatially-adaptive solvers for the Poisson, Stokes and Helmholtz problems. Conventional N-body methods apply to discrete particle interactions. With volume potentials, one replaces the sums with volume integrals. Particle N-body methods can be used to accelerate such integrals but it is more efficient to develop a special FMM. In this paper, we discuss the efficient implementation of such an FMM. We use high-order piecewise Chebyshev polynomials and an octree data structure to represent the input and output fields, enable spectrally accurate approximation of the near field, and the kernel independent FMM (KIFMM) for the far field approximation. For distributedmemory parallelism, we use space filling curves, locally essential trees, and a hypercube-like communication scheme developed previously in our group. We present new near and far interaction traversals which optimize cache usage. Also, unlike particle N-body codes, we need a 2:1 balanced tree to allow for precomputations. We present a fast scheme for 2:1balancing. Finally, we use vectorization, including the AVX instruction set on the Intel Sandy Bridge architecture to get over 50% of peak floating point performance. We use task
parallelism to employ the Xeon Phi on the Stampede platform at the Texas Advanced Computing Center (TACC). We achieve about 600GFlop/s of double precision performance on a single node. Our largest run on Stampede took 3.5s on 16K cores for a problem with 18E+9 unknowns for a highly nonuniform particle distribution (corresponding to an effective resolution exceeding 2E+23 unknowns since we used 23 levels in our octree).

moon l, long d, joshi s, tripathi v, xiao b, biros g
parallel algorithms for clustering and nearest neighbor search problems in high dimensions
acm/ieee scxy conference series, poster 

 Clustering and nearest neighbor searches in high dimensions are fundamental components of computational geometry, computational statistics, and pattern recognition. Despite the widespread need to analyze massive datasets, no MPI based implementations are available to allow this analysis to be scaled to modern highly parallel platforms. We seek to develop a set of algorithms that will provide scalability and performance for these fundamental problems.

rahimian a, lashuk i, veerapaneni s. k, aparna c, malhotra d, moon l, sampath r, shringarpure a, vetter j, vuduc r, zorin d, biros g
petascale direct numerical simulation of blood flow on 200K cores and heterogeneous architectures
acm/ieee scxy conference series, pp. 1–11, 2010, (Gordon Bell Prize)

 We present a fast, petaflop-scalable algorithm for Stokesian particulate flows. Our goal is the direct simulation of blood, which we model as a mixture of a Stokesian fluid (plasma) and red blood cells (RBCs). Directly simulating blood is a challenging multiscale, multiphysics problem. We report simulations with up to 260 million deformable RBCs. The largest simulation amounts to 90 billion unknowns in space. In terms of the number of cells, we improve the state-of-the art by several orders of magnitude: the previous largest simulation, at the same physical fidelity as ours, resolved the flow of O(1,000-10,000) RBCs. Overall, the code has scaled on 256 CPU-GPUs on Teragrid's Lincoln cluster, on 327 CPU-GPUs on Keeneland cluster, and on 200,000 AMD cores of the Oak Ridge National Laboratory'sJaguar PF system. In our largest simulation, we have achieved 0.7 Petaflops/s of sustained performance on Jaguar.

veerapaneni sk, rahimian a, biros g, zorin d
a fast algorithm for simulating vesicle flows in three dimensions
in review, pp. 1–40, 2010

 Vesicles are locally-inextensible fluid membranes that can sustain bending. In this paper, we present a fast algorithm for simulating the dynamics of vesicles suspended in viscous fluids. Spatial quantities are discretized using spherical harmonics, and quadrature rules for singular surface integrals need to be adapted to this case; an algorithm for surface reparameterization is neeed to ensure suffcient of the time- stepping scheme, and spectral filtering is introduced to maintain reasonable accuracy while minimizing computational costs. We obtain a time-stepping scheme that, in our numerical experiments, is unconditionally stable. We present results to analyze the cost and convergence rates of the overall scheme. To illustrate the applicability of the new method, we consider a few vesicle-flow interaction problems: a single vesicle in relaxation, sedimentation, shear flows, and many-vesicle flows.

chaillat s, biros g
FaIMS: a fast algorithm for the inverse medium problem with multiple frequencies and multiple sources for the scalar Helmholtz equation
Journal of Computational Physics, 231(20), pp. 4403 - 4421, 2012

We consider the inverse medium problem, for the low-frequency time-harmonic wave equation with broadband and multi-point illumination in the low frequency regime. This model finds many applications in science and engineering (e.g., seismic imaging, non-destructive evaluation, and optical tomography). We formulate the problem using a Lippmann-Schwinger formulation, which we discretize using a quadrature method. We consider small perturbations of the background medium and we invert the Born approximation. To solve this inverse problem, we use a least squares formulation that is regularized with the truncated Singular Value Decomposition (SVD). We have developed an approximate SVD method that reduces the cost of the factorization that provides orders of magnitude improvements over a black-box dense SVD. We provide numerical results that demonstrate the scalability of the method.

ghilliotti g, rahimian a, biros g, misbah c
vesicle migration and spatial organization driven by flow line curvature
physical review letters, in press, pp. 1–4, 2010

 Cross-streamline migration of deformable entities is essential in many problems such as industrial particulate flows, DNA sorting, and blood rheology. Using numerical experiments, we have discovered that vesicles suspended in a flow with curved flow lines migrate towards regions of high flow-line curvature, which are regions of high shear rates. The migration velocity of a vesicle is found to be a universal function of the normal stress difference and the flow curvature. This finding quantitatively demonstrates a direct coupling between a microscopic quantity (migration) and a macroscopic one (normal stress difference). Furthermore, simulations with multiple vesicles revealed a self-organization, which corresponds to segregation, in a rim closer to the inner cylinder, resulting from a subtle interaction among vesicles. Such segregation effects could have signifficant impact on rheology of vesicle flows.

gooya a, biros g, davatzikos c
deformable registration of glioma images using an EM algorithm and diffusion-reaction modeling 
IEEE Transcations in Medical Imaging, in press, pp. 1–15, 2010

 We investigate the problem of atlas registration of brain images with gliomas. Multi-parametric imaging modalities (T1, T1-CE, T2, and FLAIR) are first utilized for segmentations of different tissues, and to compute the posterior probability map (PBM) of membership to each tissue class, using supervised learning. Similar maps are generated in the initially normal atlas, by modeling the tumor growth, using reactiondiffusion equation. Deformable registration using a demonslike algorithm is used to register the patient images with the tumor bearing atlas. Joint estimation of the simulated tumor parameters (e.g. location, mass effect and degree of infiltration), and the spatial transformation is achieved by maximization of the log-likelihood of observation. The proposed method has been evaluated on five simulated data sets created by Statistically Simulated Deformations (SSD), and fifteen real multichannel glioma data sets.

sampath rs, biros g
parallel elastic registration using a multigrid preconditioned Gauss-Newton-Krylov solver, grid continuation and octrees 
in review, pp. 1–30, 2010

 We present a parallel algorithm for intensity-based elastic image registration. This algorithm integrates several components: parallel octrees, multigrid preconditioning, a Gauss-Newton-Krylov solver, and grid continuation. We use a non-parametric deformation model based on trilinear finite element shape functions defined on octree meshes. Our C++ based implementation uses the Message Passing Interface (MPI) standard and is built on top of the Dendro and PETSc libraries. We demonstrate the performance of our method on synthetic and medical images. We demonstrate the scalability of our implementation on up to 4096 processors on the Sun Constellation Linux Cluster "Ranger" at the Texas Advanced Computing Center (TACC).

adavani s, biros g
fast algorithms for inverse problems with parabolic PDE constraints
in review, pp. 1–15, 2010

 We present optimal complexity algorithms to solve the inverse problem with parabolic PDE constraints on the 2D unit box where the temporal variation of a source function is known but the spatial variation is unknown. We consider measurements on a single boundary and two opposite boundaries. The problem is formulated as a PDE-constrained optimization problem. We use a reduced space approach in which we eliminate the state and adjoint variables and we iterate in the inversion parameter space using the Conjugate Gradients algorithm. We derive analytical expressions for the entries of the reduced Hessian and propose preconditioners based on a low-rank approximation of the Hessian. We also propose preconditioners for problems with non-constant coefficient PDE constraints. We observed mesh-independent and noise-independent convergence of CG with the preconditioner.

rahimian a, veerapaneni sk, biros g
dynamic simulation of locally inextensible vesicles suspended in an arbitrary two-dimensional domain, a boundary integral method
journal of computational physics, pp. 6466–6484, (229) 2010

 We consider numerical algorithms for the simulation of hydrodynamics of two-dimensional vesicles suspended in a viscous Stokesian fluid. The motion of vesicles is governed by the interplay between hydrodynamic and elastic forces. Continuum models of vesicles use a two-phase fluid system with interfacial forces that include tension (to maintain local ace” in inextensibility) and bending. We use a semi-implicit time-marching scheme based on a boundary integral formulation of the Stokes problem for vesicles in an unbounded medium was proposed. In this paper, we consider confined flows within arbitrary-shaped stationary/moving geometries and flows in which the interior (to the vesicle) and exterior fluids have different viscosity. Overall, our method achieves several orders of magnitude speed-up compared to standard explicit schemes.

sampath rs, biros g
a parallel geometric multigrid method for finite elements on octree meshes
siam journal on scientific computing, 32(3), pp.1361–1392, 2010

 We present a parallel geometric multigrid algorithm for solving variable-coefficient elliptic partial differential equations on the unit box (with Dirichlet or Neumann boundary conditions) using highly nonuniform, octree-based, conforming finite element discretizations. Our octrees are 2:1 balanced, that is, we allow no more than one octree-level difference between octants that share a face, edge, or vertex. We describe a parallel algorithm whose input is an arbitrary 2:1 balanced fine-grid octree and whose output is a set of coarser 2:1 balanced octrees that are used in the multigrid scheme. Also, we derive matrix-free schemes for the discretized finite element operators and the intergrid transfer operations. The overall scheme is second-order accurate for sufficiently smooth right-hand sides and material properties; its complexity for nearly uniform trees is O( N np log N np )+O(np log np), where N is the number of octree nodes and np is the number of processors. Our implementation uses the Message Passing Interface standard. We present numerical experiments for the Laplace and Navier (linear elasticity) operators that demonstrate the scalability of our method. Our largest run was a highly nonuniform, 8-billion-unknown, elasticity calculation using 32,000 processors on the Teragrid system, “Ranger,” at the Texas Advanced Computing Center. Our implementation is publically available in the Dendro library, which is built on top of the PETSc library from Argonne National Laboratory.

aparna c, williams s,oliker l, lashuk i, biros g, vuduc r
optimizing and tuning the fast multipole method for state-of-the-art multicore architectures
ieee proceedings of ipdps, pp. 1–15, 2010

 This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multicore systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25X on Intel'quad-core Nehalem, 9.4X on AMD’s'ad-core Barcelona, and 37.6X on Sun'Victoria Falls (dual-sockets on all systems). We also compare our single-precision code against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA'most advanced GPU architecture.

kaoui b, biros g, misbah c
why do red blood cells have asymmetric shapes even in a symmetric flow?
physical refview laters, (103), 2009

 Understanding why red blood cells (RBCs) move with an asymmetric shape (slipperlike shape) in small blood vessels is a long-standing puzzle in blood circulatory research. By considering a vesicle (a model system for RBCs), we discovered that the slipper shape results from a loss in stability of the symmetric shape. It is shown that the adoption of a slipper shape causes a significant decrease in the velocity difference between the cell and the imposed flow, thus providing higher flow efficiency for RBCs. Higher membrane rigidity leads to a dramatic change in the slipper morphology, thus offering a potential diagnostic tool for cell pathologies.

lashuk i, aparna c, langston h, nguyen ta, sampath ra, shringarpure a, vuduc r, ying l, zorin d, biros g
a massively parallel adaptive fast-multipole method on heterogeneous architectures
acm/ieee scxy conference series, pp. 1–11, 2009

 We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC ’03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAYbased Kraken system at NSF/NICS) for highly non-uniform point distributions. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.


sampath s.s, adavani s.s, sundar h, lashuk i, and biros g
dendro: parallel algorithms for multigrid and amr methods on 2:1 balanced octrees
acm/ieee scxy conference series, 2008

veerapaneni s.k, raj r, biros g, and purohit p.k
analytical and numerical solutions for shapes of quiescent 2D vesicles
international journal of nonlinear mechanics, 2008

sundar h, davatzikos c, and biros g
biomechanically-constrained 4D estimation of myocardial motion
submitted for publication, 2008

veerapaneni s.k, geuyffier d, zorin d, and biros g
a boundary integral method for simulating the dynamics of inextensible vesicles suspended in a viscous fluid in 2D
journal of computational physics, 2009

biros g and dogan g
a multilevel algorithm for inverse problems with elliptic PDE constraints
inverse problems , 2008

veerapaneni s.k and biros g
the Chebyshev fast Gauss and nonuniform fast Fourier transforms and their application to the evaluation of distributed heat potentials
journal of computational physics, 2008

sundar h, sampath r.s, adavani s.s, davatzikos c, and biros g
low-constant parallel algorithms for finite element simulations using linear octrees
acm/ieee scxy conference series, 2007

sundar h, sampath r.s, and biros g
bottom-up construction and 2:1 balance refinement of linear octrees in parallel
siam journal on scientific computing, 2007

adavani s.s and biros g
multigrid algorithms for inverse problems with linear parabolic pde constraints
siam journal on scientific computing, 2007

hogea c, davatzikos c, and biros g
an image-driven parameter estimation problem for a reaction-diffusion glioma growth model with mass effects
journal of mathematical biology, 2007

veerapaneni s.k. and biros g
a fast high-order integral equation solver for the heat equation with moving boundaries in 1d
siam journal on scientific computing, 2007

ying l, biros g, and zorin d
a high-order 3d boundary integral equation solver for elliptic PDEs with smooth boundaries
journal of computational physics, 2006

akcelik v, biros g, draganescu a, ghattas o, hill j, and van bloemen waanders b
dynamic data-driven inversion for terascale simulations: real-time identification of airborne contaminants
acm/ieee scxy conference series, 2005

ying l, biros g, zorin d, and harper l
a new parallel kernel-independent fast multipole method
acm/ieee scxy conference series, 2003

ying l. biros g, and zorin d
a kernel-independent adaptive fast multipole method in two and three dimensions
journal of computational physics, 2004

akcelik v, bielak j, biros g, epanomeritakis i, fernandez a, ghattas o, kim e.j. , o'hallaron d, and tu t
high-resolution forward and inverse earthquake modeling on terascale computers
acm/ieee scxy conference series, 2003

akcelik v, biros g, and ghattas o.
parallel multiscale Gauss-Newton-Krylov methods for inverse wave propagation
acm/ieee scxy conference series, 2002

biros g, ying l, and zorin d.
an embedded boundary integral equation solver for the unsteady incompressible Navier-Stokes equations
submitted 2004

akcelik v, biros g, ghattas o,long k. r. and van bloemen waanders b
a variational finite element method for source inversion for convective-diffusive transport
finite elements in analysis and design, 2002

biros g, ying l. and zorin d.
an embedded boundary integral equation solver for the Stokes equations
journal of computational physics, 2004

biros g, ying l. and zorin d.
the embedded boundary integral equation solver for the incompressible Navier-Stokes equations
international association for boundary element methods symposium, 2002

biros g. and ghattas o
inexactness issues in LNKS algorithms for PDE-constrained optimization
springer's lecture notes in computational science and engineering, 2002

biros g. and ghattas o.
parallel Lagrange-Newton-Krylov-Schur methods for PDE-constrained optimization.
part i: the Krylov-Schur solver
siam journal on scientific computing, 2005

biros g. and ghattas o.
parallel Lagrange-Newton-Krylov-Schur methods for PDE-constrained optimization.
part ii: the Lagrange-Newton solver, and its application to optimal control of steady viscous flows
siam journal on scientific computing, 2005

biros g. and ghattas o.
a Lagrange-Newton-Krylov-Schur method for PDE-constrained optimization
SIAG/OPT news and views,2000

biros g. and ghattas o
parallel domain decomposition methods for optimal control of viscous incompressible flows
parallel computational fluid mechanics, 1999

biros g. and ghattas o
parallel Newton-Krylov algorithms for PDE-constrained optimization
acm/ieee scxy conference series,1999

biros g, kallivokas l. f, ghattas o, jaramaz b
direct ct-scan to finite element modeling using a 3d fictitious domain method with an application to biomechanics
fourth u.s national congress on computational mechanics, 1997

biros g.
2d contour smoothing and surface reconstruction of tubular anatomical structures
ms thesis (biomedical engineering), 1996

biros g, dimitropoulos d, and polymenidis p
expert system development for machinery maintenance using temperature and vibration monitoring
diploma thesis (mechanical engineering)
aristotle university, 1995