推进乡镇综合行政执法改革 大大提升了基层治理水平
今年以来,平顺县聚焦乡镇执法赋权不足、资源不够、力量分散等突出问题,积极探索、创新举措,联通一张网络,从传统型向智慧型优化升级,推
7.1频繁项集和关联规则的基本概念
What ls Frequent Pattern Analysis?
(1)An Application:
(资料图片仅供参考)
Market Basket Analysis
(2)Goal: ldentify items that are boughttogether by sufficiently many
customers
(3)Approach: Process the sales datacollected with barcode scanners tofind dependencies among itemsA classic rule:
lf someone buys diaper and milk,then he/she is likely to buy beer
Finding Association RulesStep 1: find all frequent itemsets
● Step 2: generate strong rules from the frequent itemsets
● step 2 is easier, therefore most works on association rule mining focus onstep1
● a solution for step2:
● or each frequent itemset l, generate all non-empty subsets of l for every non-empty subset s of l, output the rule " s → (l - s )”
7.2聚类的基本概念
1.Cluster: A collection of data objects
similar (or related) to one another within the same group-
dissimilar (or unrelated) to the objects in other groups
2.Clustering (or cluster analysis, data segmentation, ...)
. Finding similarities between data according to the characteristics found in the data and grouping similar dataobjects into clusters
3.Unsupervised learning: no predefined classes (i.e., learning by observations vs.learning by examples: supervised)
4.Typical applications
. As a stand-alone tool to get insight into data distribution-
As a preprocessing step for other algorithms
Partitioning Algorithms: Basic Concept
1.Partitioning method: Partitioning a database D of n objects into a set of k clusters,such that the sum of squared distances is minimized (where c is the centroid ormedoid of cluster Ci)
2.Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
(1)Global optimal: exhaustively enumerate all partitions
(2)Heuristic methods: k-means and k-medoids algorithms
(3)k-means(MacQueen'67, Lloyd'57/'82):Each cluster is represented by the center of the cluster
(4)k-medoids or PAM(Partition around medoids)(Kaufman & Rousseeuw'87): Each cluster isrepresented by one of the objects in the cluster
What ls the Problem of the K-Means Method?
(1)The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the distribution of the data
(2) K-Medoids: Instead of taking the mean value of the object in a cluster as a referencepoint, medoids can be used, which is the most centrally located object in a cluster
7.3分类的基本概念
Supervised vs.Unsupervised Learning
1.supervised learning (classification)
(1)Supervision: The training data (observations, measurements, etc.) areaccompanied by labels indicating the class of the observations
(2)New data is classified based on the training set
2.Unsupervised learning (clustering)
(1)The class labels of training data are unknown
(2)Given a set of measurements, observations, etc. with the aim of establishing theexistence of classes or clusters in the data
Classification vs. Numeric Prediction
1.Classification
(1) predicts categorical class labels (discrete or nominal)
(2)classifies data (constructs a model) based on the training set and the values (classlabels) in a classifying attribute and uses it in classifying new data
2.Typical applications
(1) credit/loan approval:
(2)Medical diagnosis: if a tumor is cancerous or benign
(3)Fraud detection: if a transaction is fraudulent
(4)Web page categorization: which category it is
3.Numeric Prediction
models continuous-valued functions,i.e., predictsunknown or missing values
Process (1):Model Construction
Classification Algorithms → Classifier(Model) →
lF rank =‘professor'OR years > 6 THEN tenured ='yes'
Process (2): Using the Model in Prediction
TestingData → Classifier → Unseen Data(Jeff, Professor, 4)
Classification—A Two-Step Process
1.Model construction: describing a set of predetermined classes
(1)Each tuple/sample is assumed to belong to a predefined class, as determined bythe class label attribute
(2)The set of tuples used for model construction is training set
(3)The model is represented as classification rules, decision trees, or mathematicalformulae
2.Model usage: for classifying future or unknown objects.
(1)Estimate accuracy of the model
(2)The known label of test sample is compared with the classified result fromthe model
(3) Accuracy rate is the percentage of test set samples that are correctly classifiedby the model
(4)Test set is independent of training set (otherwise overfitting)
(5)lf the accuracy is acceptable, use the model to classify new data
3.Note: lf the test set is used to select models, it is called validation (test) set
7.4知识图谱基本概念
通用知识图谱
(1)Google所提出的知识图谱是面向全领域的通用知识图谱。
(2)通用知识图谱主要应用于面向互联网的搜索、推荐、问答等业务场景。
(3)通用知识图谱,它强调的是广度,因而强调更多的是实体,很难生成完整的全局性的本体层的统一管理。
通用知识图谱相关项目
1.语言学类:
(1)WordNet
(2)MIT - ConceptNet5的中文部分
(3)汉语开放词网(Chinese OpenWordNet)
2.百科类:
(1)Dbpedia
(2)中文通用百科知识图谱(CN-DBpedia)
(3)Zhishi . me
(4)PKU-PIE知识库
行业图谱相关项目
行业知识图谱
(1)行业知识图谱指面向特定领域的知识图谱。
(2)用户目标对象需要考虑行业中各种级别的人员,不同人员对应的操作和业务。
(3)场景不同,因而需要一定的深度与完备性。
(4)行业知识图谱对准确度要求非常高,通常用于辅助各种复杂的分析应用或决策支持。
(5)有严格与丰富的数据模式,行业知识图谱中的实体通常属性比较多且具有行业意义。
生物医疗一Watson辅助诊断与治疗
·安德森癌症中心联合IBM Watson开展终结癌症的任务,已经投入6210万美元
通用知识图谱vS行业知识图谱
(1)面向通用领域/面向某一特定领域
(2)以常识性知识为主/基于行业数据构建
(3)“结构化的百科知识”/“基于语义技术的行业知库”
(4)强调知识的广度/强调知识的深度
(5)使用者是普通用户/潜在使用者是行业人员
知识应用关键技术
01语义搜索 02智能问答 03可视化辅助决策
延伸阅读资料
(1)Survey
(2)Knowledge Graph Construction Techniques.
(3)Review on Knowledge Graph Techniques
(4)Reviews on Knowledge Graph Research
(5)The Research Advances of Knowledge Graph
(6)A Survey on Knowledge Graphs: Representation, Acquisition and Applications (2020)
(7)Knowledge Graphs (2020)
7.5Web信息检索简介
Information Retrieval (IR)
(1)The indexing and retrieval of textual documents.
(2)Searching for pages on the World Wide Web is the most recen“killer app."
·(3) Concerned firstly with retrieving relevant documents to aquery.
(4)Concerned secondly with retrieving from large sets ofdocuments efficiently.
· Relevance is a subjective judgment and may include:
- (1)Being on the proper subject.
- Being timely (recent information).
- Being authoritative (from a trusted source).
- Satisfying the goals of the user and his/her intended useof the information (information need).
Problems with Keywords
(1)·May not retrieve relevant documents that include synonymousterms.
-“restaurant”vs.“café”-
“PRC”vs. “China”
.(2) May retrieve irrelevant documents that include ambiguousterms.
-“bat”(baseball vs. mammal)-
“Apple”(company vs. fruit)
-“bit”(unit of data vs. act of eating
Other IR-Related Tasks
(1)Automated document categorization
(2)Information filtering (spam filtering)·
(3)Information routing
(4)Automated document clustering
(5)Recommending information or products
(6)Information extraction
(7)Information integration
(8)Question answering
History of IR·
1.1960-70’s:
(1) Initial exploration of text retrieval systems for "“ small”corpora of scientific abstracts, and law and businessdocuments.
(2)Development of the basic Boolean and vector-spacemodels of retrieval.
(3)Prof. Salton and his students at Cornell University arethe leading researchers in the area.
2.1980’s:
.Large document database systems, many run by companies:
. Lexis-Nexis Dialog· MEDLINE
3.1990’s:
(1)Searching FTPable documents on the Internet
-Archie. WAIS
(2)Searching the World Wide Web
- Lycos Yahoo Altavista
4.2000’s
(1)Link analysis for Web Search:Google
(2)Automated Information Extraction:Whizbang -Fetch Burning Glass-
(3)Question Answering
·TREC Q/A track
5.2000’s continued:
(1)Multimedia IR:Image Video Audio and music.
(2)Cross-Language IR:DARPA Tides
(3)Document Summarization
Related Areas
(1)Database Management
(2)Library and Information Science
(3)Artificial Intelligence
(4)Natural Language Processing
(5)Machine Learning
7.6推荐系统简介
Recommender System are software agents that elicitthe interests and preferences of individual consumers[...] and make recommendations accordingly.
They have the potential to support and improve the quality ofthe decisions consumers make while searching for andselecting products online
——[Xiao & Benbasat,MISQ,2007]
Movie Recommendation
l like "Drama" Movies
Here's the movies you may like ...
1.The Shawshank Redemption
2.Another Happy Day
3.…
Problem Domain
(1)Recommendation systems (RS) help to match users with items·
Ease information overload
.Sales assistance (guidance, advisory, persuasion,…)
(2)Recommendation perspective
Serendipity(意外新发现)-identify items from the Long Tail.
Users did not know about existence
The theory of the Long Tail is that our culture and economy is increasingly shiftingaway from a focus on a relatively small number of "hits"(mainstream productsand markets) at the head of the demand curve and toward a huge number ofniches in the tail.As the costs of production and distribution fall, especially online,there is now less need to lump products and consumers into one-size-fits-allcontainers. ln an era without the constraints of physical shelf space and otherbottlenecks of distribution,narrowly-targeted goods and services can be aseconomically attractive as mainstream fare.
Source: http://www.longtail.com/the_long_tail/about.html
When does a RS do its job well?
"Recommend widelyunknown items thatusers might actuallylike!”
20% of items
accumulate 74% of allpositive ratings
ltems rated > 3 in
MovieLens 100K dataset
Recommender System
1.User model
(e.g. ratings,preferences,demographics,
situational context)
2.Transactions
(e.g.1-5 starsrating of a user foran item)
3.Items
(with or without
description of item
characteristics)
Paradigms of Recommender Systems:
Recommender systems reduce information overload byestimating relevance
Hybrid: combinations ofvarious inputs and/orcomposition of differentmechanism
今年以来,平顺县聚焦乡镇执法赋权不足、资源不够、力量分散等突出问题,积极探索、创新举措,联通一张网络,从传统型向智慧型优化升级,推
2022年6月15日,由中国建筑材料流通协会编制并发布的全国建材家居景气指数(简称BHI)显示,5月BHI为123 07,环比上涨8 13点,同比下跌6 96点
图①:山西临汾经济技术开发区兴荣供应链有限公司的货车整装待发。资料图片 图②:司机王勇平驾驶货车行驶在
2022年北京冬奥会的筹办过程,为中国冰雪运动发展提供了巨大动力。科技创新,成为中国冰雪运动前进道路上嘹亮的号角。在科学技术部社会发展
新华社香港2月6日电题:狮子山下的舞狮人新华社记者韦骅“左眼精,右眼灵,红光万象,富贵繁荣!”“口食八方财,
正在进行围封或强制检测的葵涌邨居民在登记(资料照片)。新华社发新华社香港2月6日电 题:凝聚香港社会共克时艰
2月6日,航拍青海省西宁市雪后美景。受较强冷空气影响,2月5日至6日,青海迎来大范围降雪天气过程,古城西宁银装
[ 相关新闻 ]