基于内部知识扩展的软提示学习点击诱饵检测方法

doi:10.16451/j.cnki.issn1003-6059.202409004

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (742 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要点击诱饵的主要目的是通过引导用户点击链接以增加页面浏览量和广告收入.点击诱饵的内容往往具有低质量、误导性或虚假性的特征,对用户产生潜在不利影响.现有的基于预训练语言模型的提示学习方法依赖外部开放知识库以检测点击诱饵,不仅性能受制于外部知识库的质量和可用性,而且不可避免地导致查询和响应的延迟.为此,文中提出基于内部知识扩展的软提示学习点击诱饵检测方法,从训练数据集本身提取扩展词,同时采用层次聚类和优化策略,在提示学习中对获得的扩展词进行微调,避免从外部知识库检索知识.此外,采用软提示学习可获得适合特定文本类型的最佳提示,避免手工模板带来的偏差.在少样本场景下,尽管文中方法只基于内部知识进行扩展,但在三个公开的点击诱饵数据集上可以以较少的时间取得较优的检测效果.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	董丙冰
	吴信东

关键词 ：点击诱饵检测, 软提示, 内部知识扩展, 提示学习

Abstract：The main purpose of clickbait is to increase page views and advertising revenues by enticing users to click on bait links. The content of clickbait is often characterized by low-quality, misleading or false information, and this potentially engenders negative effects on users. Existing prompt learning methods based on pre-trained language models are reliant on external open knowledge bases to detect clickbait. These methods not only limit model performance due to the quality and availability of external knowledge bases, but also inevitably lead to delays in queries and responses. To address this issue, a soft prompt learning method with internal knowledge expansion for clickbait detection(SPCD_IE) is proposed in this paper. Expansion words are extracted from the training dataset, while hierarchical clustering and optimization strategies are employed to fine-tune the obtained expansion words in prompt learning, and the necessity of knowledge retrieval from external knowledge bases is avoided. Moreover, soft prompt learning is utilized to obtain the best prompts suitable for specific text types, preventing biases introduced by manual templates. Although SPCD_IE expands solely based on internal knowledge in few-shot scenarios, experimental results show it achieves better detection performance on three public clickbait datasets in less time.

Key words： Clickbait Detection Soft Prompt Internal Knowledge Expansion Prompt Learning

收稿日期: 2024-05-08

ZTFLH:

TP 391

基金资助:国家自然科学基金项目(No.62120106008)资助

通讯作者: 吴信东,博士,教授,主要研究方向为数据挖掘、大数据分析、基于知识的系统.E-mail:xwu@hfut.edu.cn.

作者简介: 董丙冰,博士研究生,主要研究方向为数据挖掘.E-mail:blingdong@mail.hfut.edu.cn.

引用本文:

董丙冰, 吴信东. 基于内部知识扩展的软提示学习点击诱饵检测方法[J]. 模式识别与人工智能, 2024, 37(9): 798-810. DONG Bingbing, WU Xindong. Soft Prompt Learning with Internal Knowledge Expansion for Clickbait Detection. Pattern Recognition and Artificial Intelligence, 2024, 37(9): 798-810.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202409004 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2024/V37/I9/798

[1] CHEN Y M, CONROY N J, RUBIN V L.Misleading Online Content: Recognizing Clickbait as "False News"//Proc of the ACM Workshop on Multimodal Deception Detection. New York, USA: ACM, 2015: 15-19.
[2] LIU T, YU K, WANG L, et al. Clickbait Detection on WeChat: A Deep Model Integrating Semantic and Syntactic Information. Know-ledge-Based Systems, 2022, 245. DOI: 10.1016/j.knosys.2022.108605.
[3] BIYANI P, TSIOUTSIOULIKLIS K, BLACKMER J."8 Amazing Secrets for Getting More Clicks": Detecting Clickbaits in News Streams Using Article Informality. Proceedings of the AAAI Confe-rence on Artificial Intelligence, 2016, 30(1): 94-100.
[4] WEI W, WAN X J.Learning to Identify Ambiguous and Misleading News Headlines//Proc of the 26th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2017: 4172-4178.
[5] 王晓莉,叶东毅.基于字词特征自注意力学习的社交媒体文本分类方法.模式识别与人工智能, 2020, 33(4): 287-294.
(WANG X L, YE D Y.Social Media Text Classification Method Based on Character-Word Feature Self-attention Learning. Pattern Recognition and Artificial Intelligence, 2020,33(4): 287-294.)
[6] SHU K, WANG S H, LE T, et al. Deep Headline Generation for Clickbait Detection//Proc of the IEEE International Conference on Data Mining. Washington, USA: IEEE, 2018: 467-476.
[7] YOON S, PARK K, SHIN J, et al. Detecting Incongruity between News Headline and Body Text via a Deep Hierarchical Encoder. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 791-800.
[8] ZHENG J M, YU K, WU X F.A Deep Model Based on Lure and Similarity for Adaptive Clickbait Detection. Knowledge-Based Systems, 2021, 214. DOI: 10.1016/j.knosys.2020.106714.
[9] INDURTHI V, SYED B, GUPTA M, et al. Predicting Clickbait Strength in Online Social Media//Proc of the 28th International Conference on Computational Linguistics. Stroudsburg, USA: ACL, 2020: 4835-4846.
[10] YI X Y, ZHANG J R, LI W H, et al. Clickbait Detection via Contrastive Variational Modelling of Text and Label//Proc of the 31st International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2022: 4475-4481.
[11] FLORIDI L, CHIRIATTI M.GPT-3: Its Nature, Scope, Limits, and Consequences. Minds and Machines, 2020, 30(4): 681-694.
[12] BROWN T B, MANN B, RYDER N, et al. Language Models Are Few-Shot Learners//Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 1877-1901.
[13] 穆建媛,朱毅,周鑫柯,等.基于提示学习的中文短文本分类方法.中文信息学报, 2023, 37(7): 82-90.
(MU J Y, ZHU Y, ZHOU X K, et al. Chinese Short Text Classification Based on Prompt Learning. Journal of Chinese Information Processing, 2023, 37(7): 82-90.)
[14] GAO T Y, FISCH A, CHEN D Q.Making Pre-trained Language Models Better Few-Shot Learners//Proc of the 59th Annual Mee-ting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Long Papers). Stroudsburg, USA: ACL, 2021: 3816-3830.
[15] SCHICK T, SCHÜTZE H. Exploiting Cloze-Questions for Few Shot Text Classification and Natural Language Inference//Proc of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2021: 255-269.
[16] SCHICK T, SCHÜTZE H. It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners//Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: ACL, 2021: 2339-2352.
[17] SCHICK T, SCHMID H, SCHÜTZE H. Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification//Proc of the 28th International Conference on Computational Linguistics. Stroudsburg, USA: ACL, 2020: 5569-5578.
[18] LIU X, ZHENG Y N, DU Z X, et al. GPT Understands, Too. AI Open, 2023. DOI: 10.1016/j.aiopen.2023.08.012.
[19] HU S D, DING N, WANG H D, et al. Knowledgeable Prompt-Tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification//Proc of the 60th Annual Meeting of the Association for Computational Linguistics(Long Papers). Stroudsburg, USA: ACL, 2022: 2225-2240.
[20] RONY M M U, HASSAN N, YOUSUF M. Diving Deep into Clickbaits: Who Use Them to What Extents in Which Topics with What Effects?//Proc of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Washington, USA: IEEE, 2017: 232-239.
[21] BLOM J N, HANSEN K R.Click Bait: Forward-Reference as Lure in Online News Headlines. Journal of Pragmatics, 2015, 76: 87-100.
[22] ANAND A, CHAKRABORTY T, PARK N.We Used Neural Networks to Detect Clickbaits: You Won't Believe What Happened Next!//Proc of the 39th European Conference on IR Research. Berlin, Germany: Springer, 2017: 541-547.
[23] AGRAWAL A.Clickbait Detection Using Deep Learning//Proc of the 2nd International Conference on Next Generation Computing Technologies. Washington, USA: IEEE, 2016: 268-272.
[24] NAEEM B, KHAN A, BEG M O, et al. A Deep Learning Framework for Clickbait Detection on Social Area Network Using Natural Language Cues. Journal of Computational Social Science, 2020, 3(1): 231-243.
[25] KUMAR V, KHATTAR D, GAIROLA S, et al. Identifying Clickbait: A Multi-strategy Approach Using Neural Networks//Proc of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2018: 1225-1228.
[26] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding//Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186.
[27] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[C/OL].[2024-04-17]. https://arxiv.org/pdf/1907.11692.
[28] GOLDBERG Y. Assessing BERT's Syntactic Abilities[C/OL].[2024-04-17]. http://export.arxiv.org/abs/1901.05287.
[29] MA X F, WANG Z G, NG P, et al. Universal Text Representation from BERT: An Empirical Study[C/OL].[2024-04-17]. https://arxiv.org/abs/1910.07973.
[30] JAWAHAR G, SAGOT B, SEDDAH D.What Does BERT Learn about the Structure of Language?//Proc of the 57th Annual Mee-ting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 3651-3657.
[31] LEE N, LI B Z, WANG S N, et al. On Unifying Misinformation Detection//Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: ACL, 2021: 5479-5485.
[32] HAGEN M, FRÖBE M, JURK A, et al. Clickbait Spoiling via Question Answering and Passage Retrieval//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Long Papers). Stroudsburg, USA: ACL, 2022: 7025-7036.
[33] CHEN X, XIE X, ZHANG N Y, et al. AdaPrompt: Adaptive Prompt-Based Finetuning for Relation Extraction[C/OL].[2024-04-17]. https://arxiv.org/pdf/2104.07650v1.
[34] PETRONI F, ROCKTÄSCHEL T, LEWIS P, et al. Language Mo-dels as Knowledge Bases?//Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, USA: ACL, 2019: 2463-2473.
[35] HAN X, ZHAO W L, DING N, et al. PTR: Prompt Tuning with Rules for Text Classification. AI Open, 2022, 3: 182-192.
[36] LIU P F, YUAN W Z, FU J L, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys, 2023, 55(9). DOI: 10.1145/356081.
[37] JIANG Z B, XU F F, ARAKI J, et al. How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics, 2020, 8: 423-438.
[38] SHIN T, RAZEGHI Y, LOGAN IV R L, et al. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts//Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2020: 4222-4235.
[39] LIU X, JI K X, FU Y C, et al. P-tuning v2: Prompt Tuning Can Be Comparable to Fine-Tuning Universally across Scales and Tasks//Proc of the 60th Annual Meeting of the Association for Computational Linguistics(Short Papers). Stroudsburg, USA: ACL, 2022: 61-68.
[40] WU Y, CAO M P, ZHANG Y Z, et al. Detecting Clickbait in Chinese Social Media by Prompt Learning//Proc of the 26th International Conference on Computer Supported Cooperative Work in Design. Washington, USA: IEEE, 2023: 369-374.
[41] WEI Y Y, MO T, JIANG Y T, et al. Eliciting Knowledge from Pretrained Language Models for Prototypical Prompt Verbalizer//Proc of the International Conference on Artificial Neural Networks. Berlin, Germany: Springer, 2022: 222-233.
[42] MÜLLNER D. Modern Hierarchical, Agglomerative Clustering Algorithms[C/OL].[2024-04-17]. https://arxiv.org/pdf/1109.2378.
[43] CHAKRABORTY A, PARANJAPE B, KAKARLA S, et al. Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media//Proc of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Washington, USA: IEEE, 2016: 9-16.
[44] JIANG A Q, SABLAYROLLES A, MENSCH A, et al. Mistral 7B[C/OL].[2024-04-17]. https://arxiv.org/abs/2310.06825.