HaSa: Hardness and Structure-Aware Contrastive Knowledge Graph Embedding
ABSTRACT
We consider a contrastive learning approach to knowledge graph embedding (KGE) via InfoNCE. For KGE, efficient learning relies on augmenting the training data with negative triples. However, most KGE works overlook the bias from generating the negative triples— false negative triples (factual triples missing from the knowledge graph). We argue that the generation of high-quality (i.e., hard) neg- ative triples might lead to an increase in false negative triples. To mitigate the impact of false negative triples during the generation of hard negative triples, we propose the Hardness and Structure-aware (HaSa) contrastive KGE method, which alleviate the effect of false negative triples while generating the hard negative triples. Experi- ments show that HaSa improves the performance of InfoNCE-based KGE approaches and achieves state-of-the-art results in several met- rics for WN18RR datasets and competitive results for FB15k-237 datasets compared to both classic and pre-trained LM-based KGE methods.
Proceedings of the ACM Web Conference 2024 (WWW), Singapore, Singapore
H.Zhang, J.Zhang, I.Molybog
arXiv
H.Zhang, J.Zhang, I.Molybog
Searching for Explanations: Testing Social Scientific Methods in Synthetic Ground-Truthed Worlds
ABSTRACT
A scientific model’s usefulness relies on its ability to explain phenomena, predict how such phenomena will be impacted by future interventions, and prescribe actions to achieve desired outcomes. We study methods for learning causal models that explain the behaviors of simulated “human” populations. Through the Ground Truth project, we solved a series of Challenges where our explanations, predictions and prescriptions were scored against ground truth information. We describe the processes that emerged for applying causal discovery, network analysis, agent-based modeling and other analytical methods to inform solutions to Challenge tasks. We present our team’s overall performance results on these Challenges and discuss implications for future efforts to validate social scientific research using simulation-based challenges.
Computational and Mathematical Organization Theory
A. C. Schmidt et. al.
A. C. Schmidt et. al.
Graph Coding for Model Selection and Anomaly Detection in Gaussian Graphical Models
ABSTRACT
A classic application of description length is for model selection with the minimum description length (MDL) principle. The focus of this paper is to extend description length for data analysis beyond simple model selection and sequences of scalars. More specifically, we extend the description length for data analysis in Gaussian graphical models. These are powerful tools to model interactions among variables in a sequence of i.i.d Gaussian data in the form of a graph. Our method uses universal graph coding methods to accurately account for model complexity, and therefore provide a more rigorous approach for graph model selection. The developed method is tested with synthetic and electrocardiogram (ECG) data to find the graph model and anomaly in Gaussian graphical models. The experiments show that our method gives better performance compared to commonly used methods.
Proceedings of the International Symposium on Information Theory 2021 (ISIT)
M.Abolfazli, A.Høst-Madsen,J.Zhang, and A.Bratincsak
arXiv
M.Abolfazli, A.Høst-Madsen,J.Zhang, and A.Bratincsak
Differential Description Length for Hyperparameter Selection in Supervised Learning
ABSTRACT
Minimum description length (MDL) is an established method for model selection. For supervised learning problems, cross-validation is often used for model selection in practice. Reasons are 1) MDL is difficult to apply directly to data; 2) MDL may make restrictive statistical assumptions that decrease performance; and 3) MDL does not directly aim to minimize generalization error. In this paper, we introduce a modification to MDL, which we call differential description length (DDL). DDL partitions the data so that the codelength(s) it computes, reflects the conditional probability of seeing ‘new’ data given ‘old’ data. This differential codelength is what allows DDL to estimate generalization error like cross-validation. DDL is also better than cross-validation because it allows the learning algorithm to use the entire data without having to withhold subsets for validation and testing. Compared with MDL, DDL has both better performance (in finding models with smaller generalization error) and is easier to compute. Experiments with linear regression and deep neural networks show that DDL also outperforms cross-validation.
Proceedings of the International Symposium on Information Theory and Its Applications 2020 (ISITA)
M.Abolfazli, A.Høst-Madsen, and J.Zhang
arXiv
M.Abolfazli, A.Høst-Madsen, and J.Zhang
User Empathy in Smart Energy Management
ABSTRACT
This paper investigates interaction among residential electricity users and utility company in a distribution network with the capability of two-way communication provided by smart grid. The energy consumption scheduling of electricity users is formulated as a game-theoretic problem within a group where all players are not totally selfish. Considering empathetic behavior of human decision-making, empathetic player action to other players actions can be influenced by recognizing the well being of others. The proposed model captures the empathy of electricity users in energy scheduling and demonstrates how this behavior can decrease energy consumption and electricity prices. Furthermore, it is shown that selfish users cannot make full use of energy saving by empathetic users.
Proceedings of the IEEE Power & Energy Society General Meeting 2019 (PESGM)
M. Abolfazli, J. Zhang, and A. Kuh
M. Abolfazli, J. Zhang, and A. Kuh
Coding of graphs with application to graph anomaly detection
ABSTRACT
This paper has dual aims. First is to develop practical universal coding methods for unlabeled graphs. Second is to use these for graph anomaly detection. The paper develops two coding methods for unlabeled graphs: one based on the degree distribution, the second based on the triangle distribution. It is shown that these are efficient for different types of random graphs, and on real-world graphs. These coding methods is then used for detecting anomalous graphs, based on structure alone. It is shown that anomalous graphs can be detected with high probability.
Proceedings of IEEE International Symposium on Information Theory 2018 (ISIT)
A. Høst-Madsen and J. Zhang
arXiv
A. Høst-Madsen and J. Zhang
Greedy algorithm with approximation ratio for sampling noisy graph signals
ABSTRACT
We study the optimal sampling set selection problem in sampling a noisy $k$-bandlimited graph signal. To minimize the effect of noise when trying to reconstruct a $k$-bandlimited graph signal from $m$ samples, the optimal sampling set selection problem has been shown to be equivalent to finding a $m \times k$ submatrix with the maximum smallest singular value, $\sigma_{\min}$ \cite{chen2015discrete}. As the problem is NP-hard, we present a greedy algorithm inspired by a similar submatrix selection problem known in computer science and to which we add a local search refinement. We show that 1) in experiments, our algorithm finds a submatrix with larger $\sigma_{\min}$ than prior greedy algorithm \cite{chen2015discrete}, and 2) has a proven worst-case approximation ratio of $1/(1+\epsilon)k$, where $\epsilon$ is a constant.
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2018 (ICASSP)
C. Wu, W. Chen, and J. Zhang,
C. Wu, W. Chen, and J. Zhang,
Who is more at risk in heterogeneous networks?
ABSTRACT
Network-based epidemics models try to characterize the impact of network topology, which represents contagion pathways, on the spread of infection. Although these models explicitly consider the dynamics of individuals in the given network (i.e., the state of the system is $\mathbf{x}(t)= [x_1(t), x_2(t), \ldots, x_N(t)]^T$), analysis has focused on characterizing the vulnerability of the \emph{entire} population rather than the vulnerability of the individuals in the population. We focus on characterizing the vulnerability of the $i$th individual in the network by studying the marginal probability of infection, $P(x_i = 1)$, of the scaled SIS process. Studying the vulnerability of individuals is important because it may be tempting to assume that $P(x_i =1)$ is related to the degree of the $i$th node. Since infection rate is usually assumed to be dependent on the number of infected neighbors, then it seems reasonable that nodes with more connections (i.e., higher degree) would be more at risk. We show that this is \emph{not} always true. Further, with a closed-form approximation of $P(x_i =1)$, as solving for the exact probability requires the summation of $2^N$ terms, we characterize the conditions for when degree distribution is a good indicator of how susceptible an individual is to infection.
Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
J. Zhang and J.M.F. Moura,
J. Zhang and J.M.F. Moura,
Transmissibility of intra-host hepatitis C virus variants
ABSTRACT
Intra-host hepatitis C virus (HCV) population is genetically heterogeneous and organized in subpopulations. With exception of blood transfusion, transmission of HCV occurs via a small number of genetic variants, the effect of which is frequently described as a bottleneck. Stochasticity of transmission associated with the bottleneck is usually used to explain genetic differences among HCV populations identified in the source and recipient cases, which may be further exacerbated by intra-host HCV evolution and differential biological capacity of HCV variants to successfully establish population in a new host.
BMC Genomics, vol. 18, no. 10, 2017
D.S. Campo, J. Zhang, S. Ramachandran, Y. Khudyakov
paper
D.S. Campo, J. Zhang, S. Ramachandran, Y. Khudyakov
Cascading edge failures: a dynamic network process
ABSTRACT
This paper studies a network process that can potentially be used to model cascading failures in networks. The Dynamic Bond Percolation (DBP) process models, through stochastic local rules, the failure or recovery of an edge $(i,j)$ in a network. The probability that a working link fails or a failed link recovers may be independent of the state of other links \emph{or} may be dependent locally on the state of neighboring links as described by a cascade function $f$. In applications, this means that failures or recovery of links may have a regional preference, or, alternatively, relationships between neighbors in the network can lead to changes in the links between neighbors of neighbors. This paper shows that the dynamic evolution of $P(\mathbf{A},t)$, the probability that the network is in some state $\mathbf{A}$, describing the collective states of all the links, at time $t$ converges show that it converges to a stationary distribution. We use this distribution to study the emergence of global behaviors like consensus (i.e., catastrophic failure or full recovery of all the edges) or mixed (i.e., some failed and some working substructures). In particular, we show that, depending on the local dynamical rule, different network substructures, such as hub or triangle subgraphs, are more prone to failure.
IEEE Transactions on Network Science and Engineering DOI, 2017
J. Zhang and J.M.F. Moura
paper
J. Zhang and J.M.F. Moura
Contact process with exogenous infection and the scaled SIS process
ABSTRACT
Propagation of contagion in networks depends on the graph topology. This paper is concerned with studying the time-asymptotic behavior of the extended contact processes on static, undirected, finite-size networks. This is a contact process with nonzero exogenous infection rate (also known as the $\epsilon$-SIS model). The only known analytical characterization of the equilibrium distribution of this process is for complete networks. For large networks with arbitrary topology, it is infeasible to numerically solve for the equilibrium distribution since it requires solving the eigenvalue-eigenvector problem of a matrix that is exponential in $N$, the size of the network. We derive a condition on the infection rates under which, depending on the degree distribution of the network, the equilibrium distribution of extended contact processes on \emph{arbitrary}, finite-size networks is well approximated by a closed-form formulation. We confirm the goodness of the approximation with small networks answering inference questions like the distribution of the percentage of infected individuals and the most-probable equilibrium configuration. We then use the approximation to analyze the equilibrium distribution of the extended contact process on the 4941-node US Western power grid.
Journal of Complex Networks, vol. 5, no. 5, Oct 2017
J. Zhang and J.M.F. Moura
paper
arXiv
J. Zhang and J.M.F. Moura
Spectral radius and network processes with spontaneous infection/failure rate
ABSTRACT
We relate the time-limiting behavior of a network epidemics process to the spectral radius of the underlying network. The process we study is the scaled SIS network process, a continuous-time Markov process on a static network. Our analysis differs from previous work in that the scaled SIS process accounts for the possibility that a healthy individual has a nonzero probability of becoming infected even when all of its neighbors are healthy. For example, the source of infection may be outside a human only contact network for diseases with animal to human transmissions such as Ebola. We show that the sufficient condition for infection to become extinct not only depends on the ratio of infection and healing rates but also on $N$, the size of the network, whereas previous models, assuming no exogenous infection, showed dependency only on the infection and healing rates.
Proceedings of 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
J. Zhang and J.M.F. Moura
paper
J. Zhang and J.M.F. Moura
Finding unique dense communities
ABSTRACT
Finding densely connected subgraphs, also called communities, in networks are of interest for many applications. In previous work, we showed an optimization method for efficiently finding subgraphs denser than the overall network. This result is derived from our studies of network processes, dynamical processes that model interactions between individual agents in networks (i.e., spread of infection or cascading failures). In this paper, we prove that these subgraphs are also unique in the sense that there are no other subgraphs in the network isomorphic to these subgraphs.
Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
J. Zhang and J.M.F. Moura
paper
J. Zhang and J.M.F. Moura