Social and Information Networks
See recent articles
Showing new listings for Friday, 1 November 2024
- [1] arXiv:2410.23638 [pdf, html, other]
-
Title: Unearthing a Billion Telegram Posts about the 2024 U.S. Presidential Election: Development of a Public DatasetComments: HUMANS Lab -- Working Paper No. 2024.5 -- The 2024 Election Integrity Initiative -- University of Southern CaliforniaSubjects: Social and Information Networks (cs.SI)
With its lenient moderation policies and long-standing associations with potentially unlawful activities, Telegram has become an incubator for problematic content, frequently featuring conspiratorial, hyper-partisan, and fringe narratives. In the political sphere, these concerns are amplified by reports of Telegram channels being used to organize violent acts, such as those that occurred during the Capitol Hill attack on January 6, 2021. As the 2024 U.S. election approaches, Telegram remains a focal arena for societal and political discourse, warranting close attention from the research community, regulators, and the media. Based on these premises, we introduce and release a Telegram dataset focused on the 2024 U.S. Presidential Election, featuring over 30,000 chats and half a billion messages, including chat details, profile pictures, messages, and user information. We constructed a network of chats and analyzed the 500 most central ones, examining their shared messages. This resource represents the largest public Telegram dataset to date, offering an unprecedented opportunity to study political discussion on Telegram in the lead-up to the 2024 U.S. election. We will continue to collect data until the end of 2024, and routinely update the dataset released at: this https URL
New submissions (showing 1 of 1 entries)
- [2] arXiv:2410.23432 (cross-list from cs.CY) [pdf, html, other]
-
Title: Web Scraping for Research: Legal, Ethical, Institutional, and Scientific ConsiderationsSubjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Scientists across disciplines often use data from the internet to conduct research, generating valuable insights about human behavior. However, as generative AI relying on massive text corpora becomes increasingly valuable, platforms have greatly restricted access to data through official channels. As a result, researchers will likely engage in more web scraping to collect data, introducing new challenges and concerns for researchers. This paper proposes a comprehensive framework for web scraping in social science research for U.S.-based researchers, examining the legal, ethical, institutional, and scientific factors that researchers should consider when scraping the web. We present an overview of the current regulatory environment impacting when and how researchers can access, collect, store, and share data via scraping. We then provide researchers with recommendations to conduct scraping in a scientifically legitimate and ethical manner. We aim to equip researchers with the relevant information to mitigate risks and maximize the impact of their research amidst this evolving data access landscape.
- [3] arXiv:2410.23799 (cross-list from cs.DM) [pdf, html, other]
-
Title: Clustering Coefficient Reflecting Pairwise Relationships within HyperedgesSubjects: Discrete Mathematics (cs.DM); Social and Information Networks (cs.SI)
Hypergraphs are generalizations of simple graphs that allow for the representation of complex group interactions beyond pairwise relationships. Clustering coefficients, which quantify the local link density in networks, have been widely studied even for hypergraphs. However, existing definitions of clustering coefficients for hypergraphs do not fully capture the pairwise relationships within hyperedges. In this study, we propose a novel clustering coefficient for hypergraphs that addresses this limitation by transforming the hypergraph into a weighted graph and calculating the clustering coefficient on the resulting graph. Our definition reflects the local link density more accurately than existing definitions. We demonstrate through theoretical evaluation on higher-order motifs that the proposed definition is consistent with the clustering coefficient for simple graphs and effectively captures relationships within hyperedges missed by existing definitions. Empirical evaluation on real-world hypergraph datasets shows that our definition exhibits similar overall clustering tendencies as existing definitions while providing more precise measurements, especially for hypergraphs with larger hyperedges. The proposed clustering coefficient has the potential to reveal structural characteristics in complex hypergraphs that are not detected by existing definitions, leading to a deeper understanding of the underlying interaction patterns in complex hypergraphs.
- [4] arXiv:2410.23855 (cross-list from cs.LG) [pdf, other]
-
Title: RAGraph: A General Retrieval-Augmented Graph Learning FrameworkXinke Jiang, Rihong Qiu, Yongxin Xu, Wentao Zhang, Yichen Zhu, Ruizhe Zhang, Yuchen Fang, Xu Chu, Junfeng Zhao, Yasha WangComments: NeurIPS 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Graph Neural Networks (GNNs) have become essential in interpreting relational data across various domains, yet, they often struggle to generalize to unseen graph data that differs markedly from training instances. In this paper, we introduce a novel framework called General Retrieval-Augmented Graph Learning (RAGraph), which brings external graph data into the general graph foundation model to improve model generalization on unseen scenarios. On the top of our framework is a toy graph vector library that we established, which captures key attributes, such as features and task-specific label information. During inference, the RAGraph adeptly retrieves similar toy graphs based on key similarities in downstream tasks, integrating the retrieved data to enrich the learning context via the message-passing prompting mechanism. Our extensive experimental evaluations demonstrate that RAGraph significantly outperforms state-of-the-art graph learning methods in multiple tasks such as node classification, link prediction, and graph classification across both dynamic and static datasets. Furthermore, extensive testing confirms that RAGraph consistently maintains high performance without the need for task-specific fine-tuning, highlighting its adaptability, robustness, and broad applicability.
Cross submissions (showing 3 of 3 entries)
- [5] arXiv:2301.11486 (replaced) [pdf, html, other]
-
Title: Sub-Standards and Mal-Practices: Misinformation's Role in Insular, Polarized, and Toxic Interactions on RedditSubjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
In this work, we examine the influence of unreliable information on political incivility and toxicity on the social media platform Reddit. We show that comments on articles from unreliable news websites are posted more often in right-leaning subreddits and that within individual subreddits, comments, on average, are 32% more likely to be toxic compared to comments on reliable news articles. Using a regression model, we show that these results hold after accounting for partisanship and baseline toxicity rates within individual subreddits. Utilizing a zero-inflated negative binomial regression, we further show that as the toxicity of subreddits increases, users are more likely to comment on posts from known unreliable websites. Finally, modeling user interactions with an exponential random graph model, we show that when reacting to a Reddit submission that links to a website known for spreading unreliable information, users are more likely to be toxic to users of different political beliefs. Our results collectively illustrate that low-quality/unreliable information not only predicts increased toxicity but also polarizing interactions between users of different political orientations.
- [6] arXiv:2305.16590 (replaced) [pdf, html, other]
-
Title: Seeding with Differentially Private Network InformationSubjects: Social and Information Networks (cs.SI); Computational Complexity (cs.CC); Multiagent Systems (cs.MA); Probability (math.PR); Applications (stat.AP)
In public health interventions such as the distribution of preexposure prophylaxis (PrEP) for HIV prevention, decision makers rely on seeding algorithms to identify key individuals who can amplify the impact of their interventions. In such cases, building a complete sexual activity network is often infeasible due to privacy concerns. Instead, contact tracing can provide influence samples, that is, sequences of sexual contacts without requiring complete network information. This presents two challenges: protecting individual privacy in contact data and adapting seeding algorithms to work effectively with incomplete network information. To solve these two problems, we study privacy guarantees for influence maximization algorithms when the social network is unknown and the inputs are samples of prior influence cascades that are collected at random and need privacy protection. Building on recent results that address seeding with costly network information, our privacy-preserving algorithms introduce randomization in the collected data or the algorithm output and can bound the privacy loss of each node (or group of nodes) in deciding to include their data in the algorithm input. We provide theoretical guarantees of seeding performance with a limited sample size subject to differential privacy budgets in both central and local privacy regimes. Simulations on synthetic random graphs and empirically grounded sexual contacts of men who have sex with men reveal the diminishing value of network information with decreasing privacy budget in both regimes and graceful decrease in performance with decreasing privacy budget in the central regime. Achieving good performance with local privacy guarantees requires relatively higher privacy budgets that confirm our theoretical expectations.
- [7] arXiv:2307.10349 (replaced) [pdf, html, other]
-
Title: Twits, Toxic Tweets, and Tribal Tendencies: Trends in Politically Polarized Posts on TwitterSubjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
Social media platforms are often blamed for exacerbating political polarization and worsening public dialogue. Many claim that hyperpartisan users post pernicious content, slanted to their political views, inciting contentious and toxic conversations. However, what factors are actually associated with increased online toxicity and negative interactions? In this work, we explore the role that partisanship and affective polarization play in contributing to toxicity both on an individual user level and a topic level on Twitter/X. To do this, we train and open-source a DeBERTa-based toxicity detector with a contrastive objective that outperforms the Google Jigsaw Perspective Toxicity detector on the Civil Comments test dataset. Then, after collecting 89.6 million tweets from 43,151 Twitter/X users, we determine how several account-level characteristics, including partisanship along the US left-right political spectrum and account age, predict how often users post toxic content. Fitting a Generalized Additive Model to our data, we find that the diversity of views and the toxicity of the other accounts with which that user engages has a more marked effect on their own toxicity. Namely, toxic comments are correlated with users who engage with a wider array of political views. Performing topic analysis on the toxic content posted by these accounts using the large language model MPNet and a version of the DP-Means clustering algorithm, we find similar behavior across 5,288 individual topics, with users becoming more toxic as they engage with a wider diversity of politically charged topics.
- [8] arXiv:2410.18492 (replaced) [pdf, other]
-
Title: Improving Information Diffusion Prediction by Tackling Noise and Sparsity ChallengesComments: Equation 20 in section 4.4.2 contains an error; the calculation method is incorrectSubjects: Social and Information Networks (cs.SI)
With the widespread use of online social media platforms, information diffusion has become a prevalent phenomenon, making Information Diffusion Prediction (IDP) increasingly important for various applications. Despite significant advancements in IDP research, existing methods often overlook issues of noise and sparsity in information diffusion data. User behaviors are frequently influenced by external factors, introducing noise into the data and hindering models' understanding of true diffusion patterns. Additionally, many users have limited interaction data, leading to data sparsity and restricting models' ability to effectively capture user preferences. To address these challenges, we propose a novel framework called DDiff, which tackles noise and sparsity issues through denoising diffusion and cross-domain contrastive learning. First, we introduce a graph learning encoder module that captures the social homophily of users through their relationships and higher-order connections via information diffusion hypergraphs (IDH). Next, a cross-domain contrastive learning module is designed to facilitate effective knowledge transfer between the information and social domains, addressing the sparsity problem. Furthermore, we propose a denoising diffusion module with IDH to effectively mitigate noise issues by introducing random noise in the forward process and iteratively recovering the corrupted embeddings in the reverse process. Finally, we implement a prediction module to determine the likelihood of subsequent users becoming infected. Experimental results demonstrate that DDiff significantly outperforms state-of-the-art methods in the information diffusion prediction task.
- [9] arXiv:2410.18742 (replaced) [pdf, other]
-
Title: Continuous Dynamic Modeling via Neural ODEs for Popularity Trajectory PredictionComments: The time complexity analysis in section 4.4 contains error; we overlooked the impact of the memory moduleSubjects: Social and Information Networks (cs.SI)
Popularity prediction for information cascades has significant applications across various domains, including opinion monitoring and advertising recommendations. While most existing methods consider this as a discrete problem, popularity actually evolves continuously, exhibiting rich dynamic properties such as change rates and growth patterns. In this paper, we argue that popularity trajectory prediction is more practical, as it aims to forecast the entire trajectory of how popularity unfolds over arbitrary future time. This approach offers insights into both instantaneous popularity and the underlying dynamic properties. However, traditional methods for popularity trajectory prediction primarily rely on specific diffusion mechanism assumptions, which may not align well with real-world dynamics and compromise their performance. To address these limitations, we propose NODEPT, a novel approach based on neural ordinary differential equations (ODEs) for popularity trajectory prediction. NODEPT models the continuous dynamics of the underlying diffusion system using neural ODEs. We first employ an encoder to initialize the latent state representations of information cascades, consisting of two representation learning modules that capture the co-evolution structural characteristics and temporal patterns of cascades from different perspectives. More importantly, we then introduce an ODE-based generative module that learns the dynamics of the diffusion system in the latent space. Finally, a decoder transforms the latent state into the prediction of the future popularity trajectory. Our experimental results on three real-world datasets demonstrate the superiority and rationality of the proposed NODEPT method.
- [10] arXiv:2402.02518 (replaced) [pdf, html, other]
-
Title: Unifying Generation and Prediction on Graphs with Latent Graph DiffusionComments: Accepted to NeurIPS 2024Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
In this paper, we propose the first framework that enables solving graph learning tasks of all levels (node, edge and graph) and all types (generation, regression and classification) using one formulation. We first formulate prediction tasks including regression and classification into a generic (conditional) generation framework, which enables diffusion models to perform deterministic tasks with provable guarantees. We then propose Latent Graph Diffusion (LGD), a generative model that can generate node, edge, and graph-level features of all categories simultaneously. We achieve this goal by embedding the graph structures and features into a latent space leveraging a powerful encoder and decoder, then training a diffusion model in the latent space. LGD is also capable of conditional generation through a specifically designed cross-attention mechanism. Leveraging LGD and the ``all tasks as generation'' formulation, our framework is capable of solving graph tasks of various levels and types. We verify the effectiveness of our framework with extensive experiments, where our models achieve state-of-the-art or highly competitive results across a wide range of generation and regression tasks.
- [11] arXiv:2402.11821 (replaced) [pdf, html, other]
-
Title: Microstructures and Accuracy of Graph Recall by Large Language ModelsComments: Accepted at NeurIPS 2024; Code available at: this https URLSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall has been studied by cognitive scientists for decades, and has been found to often exhibit certain structural patterns of bias that align with human handling of social relationships. To date, however, we know little about how LLMs behave in analogous graph recall tasks: do their recalled graphs also exhibit certain biased patterns, and if so, how do they compare with humans and affect other graph reasoning tasks? In this work, we perform the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall. We find that LLMs not only underperform often in graph recall, but also tend to favor more triangles and alternating 2-paths. Moreover, we find that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from -- by yielding the best recall accuracy when the graph is narrated in a language style consistent with its original domain.