Publications

2025

Gaussian Graphical Model Selection Using Graph Compression

ABSTRACT

Conditional independence between variables in Gaussian graphical models (also known as Gaussian Markov random fields) is represented by the conditional independence graph, 𝐺. Most approaches for inferring conditional independence graph rely on the penalized log-likelihood, where a regularization hyperparameter, 𝜆, controls the preference for either a sparsely or densely connected solution. In this article, we present a method for selecting 𝜆 based on the minimum description length (MDL) principle. Our approach improves upon previous methods by better accounting for 𝐺 using our novel graph coders. Experiments on known Gaussian graphical models demonstrate that our approach has a higher F1 score in recovering the true conditional independence graph than existing methods, especially when the number of observations is small compared to the number of variables. We also applied our method to a real-world electrocardiogram (ECG) dataset to investigate the inferred conditional independence graph in healthy subjects versus a group of subjects with Kawasaki disease. Finally, we used the learned conditional independence graphs for the classification of healthy subjects versus those with Kawasaki disease.

ACM Transactions on Probabilistic Machine Learning

M. Abolfazli, A. Høst-Madsen, J. Zhang, and A. Bratincsak

ACM

mRNA2vec: mRNA Embedding with Language Model in the 5′ UTR-CDS for mRNA Design

ABSTRACT

Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.

Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence, Philadelphia, Pennsylvania

H. Zhang, X. Gao, J. Zhang, L. Lai

arXiv

REAL: Response Embedding-based Alignment for LLMs

ABSTRACT

Aligning large language models (LLMs) to human preferences is a crucial step in building helpful and safe AI tools, which usually involve training on supervised datasets. Popular algorithms such as Direct Preference Optimization (DPO) rely on pairs of AI-generated responses ranked according to human annotation. The response pair annotation process might bring human bias. Building a correct preference dataset is the costly part of the alignment pipeline. To improve annotation efficiency and quality in the LLMs alignment, we propose REAL: Response Embedding-based Alignment for LLMs, a strategy for constructing a high-quality training dataset that focuses on acquiring the less ambiguous preference pairs for labeling out of a set of response candidates. Our selection process is based on the similarity of embedding responses independently of prompts, which guarantees the selection process in an off-policy setting, avoiding adaptively measuring the similarity during the training. Experimental results on real-world dataset SHP2 and synthetic HH-RLHF benchmarks indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors. The model aligned with dissimilar response pairs obtained a better margin and win rate on the dialogue task. Our findings suggest that focusing on distinct pairs can reduce the label error and improve LLM alignment efficiency, saving up to 65% of annotators' work.

Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada

H. Zhang, X. Zhao, I. Molybog, J. Zhang

arXiv

Out-of-Distribution Detection Using Maximum Entropy Coding and Generative Networks

ABSTRACT

Given a default distribution P and a set of test data xM= {x1, x2, . . . , xM } this paper seeks to answer the question if it was likely that xM was generated by P . For discrete distributions, the definitive answer is in principle given by Kolmogorov-Martin-L¨ of randomness. In this paper we seek to generalize this to continuous distributions. We consider a set of statistics T1(xM ), T2(xM ), . . .. To each statistic we associate its maximum entropy distribution and with this a universal source coder. The maximum entropy distributions are subsequently combined to give a total codelength, which is compared with −log P (xM ). We show that this approach satisfied a number of theoretical properties. For real world data P usually is unknown. We transform data into a standard distribution in the latent space using a bidirectional generate network and use maximum entropy coding there. We compare the resulting method to other methods that also used generative neural networks to detect anomalies. In most cases, our results show better performance.

Proceedings of the 2024 International Symposium on Information Theory and Its Applications (ISITA), Taipei, Taiwan

M. Abolfazli, M. Z. Amirani, A. Høst-Madsen, J. Zhang and A. Bratincsak

arXiv

2024

HaSa: Hardness and Structure-Aware Contrastive Knowledge Graph Embedding

ABSTRACT

We consider a contrastive learning approach to knowledge graph embedding (KGE) via InfoNCE. For KGE, efficient learning relies on augmenting the training data with negative triples. However, most KGE works overlook the bias from generating the negative triples— false negative triples (factual triples missing from the knowledge graph). We argue that the generation of high-quality (i.e., hard) neg- ative triples might lead to an increase in false negative triples. To mitigate the impact of false negative triples during the generation of hard negative triples, we propose the Hardness and Structure-aware (HaSa) contrastive KGE method, which alleviate the effect of false negative triples while generating the hard negative triples. Experi- ments show that HaSa improves the performance of InfoNCE-based KGE approaches and achieves state-of-the-art results in several met- rics for WN18RR datasets and competitive results for FB15k-237 datasets compared to both classic and pre-trained LM-based KGE methods.

Proceedings of the ACM Web Conference 2024 (WWW), Singapore, Singapore

H.Zhang, J.Zhang, I.Molybog

arXiv

2022

Searching for Explanations: Testing Social Scientific Methods in Synthetic Ground-Truthed Worlds

ABSTRACT

A scientific model’s usefulness relies on its ability to explain phenomena, predict how such phenomena will be impacted by future interventions, and prescribe actions to achieve desired outcomes. We study methods for learning causal models that explain the behaviors of simulated “human” populations. Through the Ground Truth project, we solved a series of Challenges where our explanations, predictions and prescriptions were scored against ground truth information. We describe the processes that emerged for applying causal discovery, network analysis, agent-based modeling and other analytical methods to inform solutions to Challenge tasks. We present our team’s overall performance results on these Challenges and discuss implications for future efforts to validate social scientific research using simulation-based challenges.

Computational and Mathematical Organization Theory

A. C. Schmidt et. al.

2021

Graph Coding for Model Selection and Anomaly Detection in Gaussian Graphical Models

ABSTRACT

A classic application of description length is for model selection with the minimum description length (MDL) principle. The focus of this paper is to extend description length for data analysis beyond simple model selection and sequences of scalars. More specifically, we extend the description length for data analysis in Gaussian graphical models. These are powerful tools to model interactions among variables in a sequence of i.i.d Gaussian data in the form of a graph. Our method uses universal graph coding methods to accurately account for model complexity, and therefore provide a more rigorous approach for graph model selection. The developed method is tested with synthetic and electrocardiogram (ECG) data to find the graph model and anomaly in Gaussian graphical models. The experiments show that our method gives better performance compared to commonly used methods.

Proceedings of the International Symposium on Information Theory 2021 (ISIT)

M.Abolfazli, A.Høst-Madsen,J.Zhang, and A.Bratincsak

arXiv

2020

Differential Description Length for Hyperparameter Selection in Supervised Learning

ABSTRACT

Minimum description length (MDL) is an established method for model selection. For supervised learning problems, cross-validation is often used for model selection in practice. Reasons are 1) MDL is difficult to apply directly to data; 2) MDL may make restrictive statistical assumptions that decrease performance; and 3) MDL does not directly aim to minimize generalization error. In this paper, we introduce a modification to MDL, which we call differential description length (DDL). DDL partitions the data so that the codelength(s) it computes, reflects the conditional probability of seeing ‘new’ data given ‘old’ data. This differential codelength is what allows DDL to estimate generalization error like cross-validation. DDL is also better than cross-validation because it allows the learning algorithm to use the entire data without having to withhold subsets for validation and testing. Compared with MDL, DDL has both better performance (in finding models with smaller generalization error) and is easier to compute. Experiments with linear regression and deep neural networks show that DDL also outperforms cross-validation.

Proceedings of the International Symposium on Information Theory and Its Applications 2020 (ISITA)

M.Abolfazli, A.Høst-Madsen, and J.Zhang

arXiv

2019

User Empathy in Smart Energy Management

ABSTRACT

This paper investigates interaction among residential electricity users and utility company in a distribution network with the capability of two-way communication provided by smart grid. The energy consumption scheduling of electricity users is formulated as a game-theoretic problem within a group where all players are not totally selfish. Considering empathetic behavior of human decision-making, empathetic player action to other players actions can be influenced by recognizing the well being of others. The proposed model captures the empathy of electricity users in energy scheduling and demonstrates how this behavior can decrease energy consumption and electricity prices. Furthermore, it is shown that selfish users cannot make full use of energy saving by empathetic users.

Proceedings of the IEEE Power & Energy Society General Meeting 2019 (PESGM)

M. Abolfazli, J. Zhang, and A. Kuh

2018

Coding of graphs with application to graph anomaly detection

ABSTRACT

This paper has dual aims. First is to develop practical universal coding methods for unlabeled graphs. Second is to use these for graph anomaly detection. The paper develops two coding methods for unlabeled graphs: one based on the degree distribution, the second based on the triangle distribution. It is shown that these are efficient for different types of random graphs, and on real-world graphs. These coding methods is then used for detecting anomalous graphs, based on structure alone. It is shown that anomalous graphs can be detected with high probability.

Proceedings of IEEE International Symposium on Information Theory 2018 (ISIT)

A. Høst-Madsen and J. Zhang

arXiv

Greedy algorithm with approximation ratio for sampling noisy graph signals

ABSTRACT

We study the optimal sampling set selection problem in sampling a noisy $k$-bandlimited graph signal. To minimize the effect of noise when trying to reconstruct a $k$-bandlimited graph signal from $m$ samples, the optimal sampling set selection problem has been shown to be equivalent to finding a $m \times k$ submatrix with the maximum smallest singular value, $\sigma_{\min}$ \cite{chen2015discrete}. As the problem is NP-hard, we present a greedy algorithm inspired by a similar submatrix selection problem known in computer science and to which we add a local search refinement. We show that 1) in experiments, our algorithm finds a submatrix with larger $\sigma_{\min}$ than prior greedy algorithm \cite{chen2015discrete}, and 2) has a proven worst-case approximation ratio of $1/(1+\epsilon)k$, where $\epsilon$ is a constant.

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2018 (ICASSP)

C. Wu, W. Chen, and J. Zhang,

Who is more at risk in heterogeneous networks?

ABSTRACT

Network-based epidemics models try to characterize the impact of network topology, which represents contagion pathways, on the spread of infection. Although these models explicitly consider the dynamics of individuals in the given network (i.e., the state of the system is $\mathbf{x}(t)= [x_1(t), x_2(t), \ldots, x_N(t)]^T$), analysis has focused on characterizing the vulnerability of the \emph{entire} population rather than the vulnerability of the individuals in the population. We focus on characterizing the vulnerability of the $i$th individual in the network by studying the marginal probability of infection, $P(x_i = 1)$, of the scaled SIS process. Studying the vulnerability of individuals is important because it may be tempting to assume that $P(x_i =1)$ is related to the degree of the $i$th node. Since infection rate is usually assumed to be dependent on the number of infected neighbors, then it seems reasonable that nodes with more connections (i.e., higher degree) would be more at risk. We show that this is \emph{not} always true. Further, with a closed-form approximation of $P(x_i =1)$, as solving for the exact probability requires the summation of $2^N$ terms, we characterize the conditions for when degree distribution is a good indicator of how susceptible an individual is to infection.

Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

J. Zhang and J.M.F. Moura,

2017

Transmissibility of intra-host hepatitis C virus variants

ABSTRACT

Intra-host hepatitis C virus (HCV) population is genetically heterogeneous and organized in subpopulations. With exception of blood transfusion, transmission of HCV occurs via a small number of genetic variants, the effect of which is frequently described as a bottleneck. Stochasticity of transmission associated with the bottleneck is usually used to explain genetic differences among HCV populations identified in the source and recipient cases, which may be further exacerbated by intra-host HCV evolution and differential biological capacity of HCV variants to successfully establish population in a new host.

BMC Genomics, vol. 18, no. 10, 2017
D.S. Campo, J. Zhang, S. Ramachandran, Y. Khudyakov

paper

Cascading edge failures: a dynamic network process

ABSTRACT

This paper studies a network process that can potentially be used to model cascading failures in networks. The Dynamic Bond Percolation (DBP) process models, through stochastic local rules, the failure or recovery of an edge $(i,j)$ in a network. The probability that a working link fails or a failed link recovers may be independent of the state of other links \emph{or} may be dependent locally on the state of neighboring links as described by a cascade function $f$. In applications, this means that failures or recovery of links may have a regional preference, or, alternatively, relationships between neighbors in the network can lead to changes in the links between neighbors of neighbors. This paper shows that the dynamic evolution of $P(\mathbf{A},t)$, the probability that the network is in some state $\mathbf{A}$, describing the collective states of all the links, at time $t$ converges show that it converges to a stationary distribution. We use this distribution to study the emergence of global behaviors like consensus (i.e., catastrophic failure or full recovery of all the edges) or mixed (i.e., some failed and some working substructures). In particular, we show that, depending on the local dynamical rule, different network substructures, such as hub or triangle subgraphs, are more prone to failure.

IEEE Transactions on Network Science and Engineering DOI, 2017
J. Zhang and J.M.F. Moura

paper

Contact process with exogenous infection and the scaled SIS process

ABSTRACT

Propagation of contagion in networks depends on the graph topology. This paper is concerned with studying the time-asymptotic behavior of the extended contact processes on static, undirected, finite-size networks. This is a contact process with nonzero exogenous infection rate (also known as the $\epsilon$-SIS model). The only known analytical characterization of the equilibrium distribution of this process is for complete networks. For large networks with arbitrary topology, it is infeasible to numerically solve for the equilibrium distribution since it requires solving the eigenvalue-eigenvector problem of a matrix that is exponential in $N$, the size of the network. We derive a condition on the infection rates under which, depending on the degree distribution of the network, the equilibrium distribution of extended contact processes on \emph{arbitrary}, finite-size networks is well approximated by a closed-form formulation. We confirm the goodness of the approximation with small networks answering inference questions like the distribution of the percentage of infected individuals and the most-probable equilibrium configuration. We then use the approximation to analyze the equilibrium distribution of the extended contact process on the 4941-node US Western power grid.

Journal of Complex Networks, vol. 5, no. 5, Oct 2017
J. Zhang and J.M.F. Moura

paper arXiv

2016

Spectral radius and network processes with spontaneous infection/failure rate

ABSTRACT

We relate the time-limiting behavior of a network epidemics process to the spectral radius of the underlying network. The process we study is the scaled SIS network process, a continuous-time Markov process on a static network. Our analysis differs from previous work in that the scaled SIS process accounts for the possibility that a healthy individual has a nonzero probability of becoming infected even when all of its neighbors are healthy. For example, the source of infection may be outside a human only contact network for diseases with animal to human transmissions such as Ebola. We show that the sufficient condition for infection to become extinct not only depends on the ratio of infection and healing rates but also on $N$, the size of the network, whereas previous models, assuming no exogenous infection, showed dependency only on the infection and healing rates.

Proceedings of 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
J. Zhang and J.M.F. Moura

paper

Finding unique dense communities

ABSTRACT

Finding densely connected subgraphs, also called communities, in networks are of interest for many applications. In previous work, we showed an optimization method for efficiently finding subgraphs denser than the overall network. This result is derived from our studies of network processes, dynamical processes that model interactions between individual agents in networks (i.e., spread of infection or cascading failures). In this paper, we prove that these subgraphs are also unique in the sense that there are no other subgraphs in the network isomorphic to these subgraphs.

Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
J. Zhang and J.M.F. Moura

paper

Dr. June Zhang

Associate Professor @ University of Hawaii at Manoa

orcid: 0000-0002-4578-5759

Publications

2025

Gaussian Graphical Model Selection Using Graph Compression

mRNA2vec: mRNA Embedding with Language Model in the 5′ UTR-CDS for mRNA Design

REAL: Response Embedding-based Alignment for LLMs

Out-of-Distribution Detection Using Maximum Entropy Coding and Generative Networks

2024

HaSa: Hardness and Structure-Aware Contrastive Knowledge Graph Embedding

2022

Searching for Explanations: Testing Social Scientific Methods in Synthetic Ground-Truthed Worlds

2021

Graph Coding for Model Selection and Anomaly Detection in Gaussian Graphical Models

2020

Differential Description Length for Hyperparameter Selection in Supervised Learning

2019

User Empathy in Smart Energy Management

2018

Coding of graphs with application to graph anomaly detection

Greedy algorithm with approximation ratio for sampling noisy graph signals

Who is more at risk in heterogeneous networks?

2017

Transmissibility of intra-host hepatitis C virus variants

Cascading edge failures: a dynamic network process

Contact process with exogenous infection and the scaled SIS process

2016

Spectral radius and network processes with spontaneous infection/failure rate

Finding unique dense communities