当前关注：7.互联网大规模数据分析技术

2023-02-06 20:02:27来源：哔哩哔哩

7.1频繁项集和关联规则的基本概念

What ls Frequent Pattern Analysis?

（1）An Application:

(资料图片仅供参考)

Market Basket Analysis

（2）Goal: ldentify items that are boughttogether by sufficiently many

customers

（3）Approach: Process the sales datacollected with barcode scanners tofind dependencies among itemsA classic rule:

lf someone buys diaper and milk,then he/she is likely to buy beer

Finding Association RulesStep 1: find all frequent itemsets

● Step 2: generate strong rules from the frequent itemsets

● step 2 is easier, therefore most works on association rule mining focus onstep1

● a solution for step2:

● or each frequent itemset l, generate all non-empty subsets of l for every non-empty subset s of l, output the rule " s → (l - s )”

7.2聚类的基本概念

1.Cluster: A collection of data objects

similar (or related) to one another within the same group-

dissimilar (or unrelated) to the objects in other groups

2.Clustering (or cluster analysis, data segmentation, ...)

. Finding similarities between data according to the characteristics found in the data and grouping similar dataobjects into clusters

3.Unsupervised learning: no predefined classes (i.e., learning by observations vs.learning by examples: supervised)

4.Typical applications

. As a stand-alone tool to get insight into data distribution-

As a preprocessing step for other algorithms

Partitioning Algorithms: Basic Concept

1.Partitioning method: Partitioning a database D of n objects into a set of k clusters,such that the sum of squared distances is minimized (where c is the centroid ormedoid of cluster Ci)

2.Given k, find a partition of k clusters that optimizes the chosen partitioning criterion

（1）Global optimal: exhaustively enumerate all partitions

（2）Heuristic methods: k-means and k-medoids algorithms

（3）k-means(MacQueen'67, Lloyd'57/'82):Each cluster is represented by the center of the cluster

（4）k-medoids or PAM(Partition around medoids)(Kaufman & Rousseeuw'87): Each cluster isrepresented by one of the objects in the cluster

What ls the Problem of the K-Means Method?

（1）The k-means algorithm is sensitive to outliers ！

Since an object with an extremely large value may substantially distort the distribution of the data

（2） K-Medoids: Instead of taking the mean value of the object in a cluster as a referencepoint, medoids can be used, which is the most centrally located object in a cluster

7.3分类的基本概念

Supervised vs.Unsupervised Learning

1.supervised learning (classification)

（1）Supervision: The training data (observations, measurements, etc.) areaccompanied by labels indicating the class of the observations

（2）New data is classified based on the training set

2.Unsupervised learning (clustering)

（1）The class labels of training data are unknown

（2）Given a set of measurements, observations, etc. with the aim of establishing theexistence of classes or clusters in the data

Classification vs. Numeric Prediction

1.Classification

（1） predicts categorical class labels (discrete or nominal)

（2）classifies data (constructs a model) based on the training set and the values (classlabels) in a classifying attribute and uses it in classifying new data

2.Typical applications

（1） credit/loan approval:

（2）Medical diagnosis: if a tumor is cancerous or benign

（3）Fraud detection: if a transaction is fraudulent

（4）Web page categorization: which category it is

3.Numeric Prediction

models continuous-valued functions,i.e., predictsunknown or missing values

Process (1):Model Construction

Classification Algorithms → Classifier(Model) →

lF rank =‘professor'OR years > 6 THEN tenured ='yes'

Process (2): Using the Model in Prediction

TestingData → Classifier → Unseen Data(Jeff, Professor, 4)

Classification—A Two-Step Process

1.Model construction: describing a set of predetermined classes

（1）Each tuple/sample is assumed to belong to a predefined class, as determined bythe class label attribute

（2）The set of tuples used for model construction is training set

（3）The model is represented as classification rules, decision trees, or mathematicalformulae

2.Model usage: for classifying future or unknown objects.

（1）Estimate accuracy of the model

（2）The known label of test sample is compared with the classified result fromthe model

（3） Accuracy rate is the percentage of test set samples that are correctly classifiedby the model

（4）Test set is independent of training set (otherwise overfitting)

（5）lf the accuracy is acceptable, use the model to classify new data

3.Note: lf the test set is used to select models, it is called validation (test) set

7.4知识图谱基本概念

通用知识图谱

（1）Google所提出的知识图谱是面向全领域的通用知识图谱。

（2）通用知识图谱主要应用于面向互联网的搜索、推荐、问答等业务场景。

（3）通用知识图谱，它强调的是广度，因而强调更多的是实体，很难生成完整的全局性的本体层的统一管理。

通用知识图谱相关项目

1.语言学类:

（1）WordNet

（2）MIT - ConceptNet5的中文部分

（3）汉语开放词网(Chinese OpenWordNet)

2.百科类:

（1）Dbpedia

（2）中文通用百科知识图谱(CN-DBpedia)

（3）Zhishi . me

（4）PKU-PIE知识库

行业图谱相关项目

行业知识图谱

（1）行业知识图谱指面向特定领域的知识图谱。

（2）用户目标对象需要考虑行业中各种级别的人员，不同人员对应的操作和业务。

（3）场景不同，因而需要一定的深度与完备性。

（4）行业知识图谱对准确度要求非常高，通常用于辅助各种复杂的分析应用或决策支持。

（5）有严格与丰富的数据模式，行业知识图谱中的实体通常属性比较多且具有行业意义。

生物医疗一Watson辅助诊断与治疗

·安德森癌症中心联合IBM Watson开展终结癌症的任务，已经投入6210万美元

通用知识图谱vS行业知识图谱

（1）面向通用领域/面向某一特定领域

（2）以常识性知识为主/基于行业数据构建

（3）“结构化的百科知识”/“基于语义技术的行业知库”

（4）强调知识的广度/强调知识的深度

（5）使用者是普通用户/潜在使用者是行业人员

知识应用关键技术

01语义搜索 02智能问答 03可视化辅助决策

延伸阅读资料

（1）Survey

（2）Knowledge Graph Construction Techniques.

（3）Review on Knowledge Graph Techniques

（4）Reviews on Knowledge Graph Research

（5）The Research Advances of Knowledge Graph

（6）A Survey on Knowledge Graphs: Representation, Acquisition and Applications (2020)

（7）Knowledge Graphs (2020)

7.5Web信息检索简介

Information Retrieval (IR)

（1）The indexing and retrieval of textual documents.

（2）Searching for pages on the World Wide Web is the most recen“killer app."

·（3） Concerned firstly with retrieving relevant documents to aquery.

（4）Concerned secondly with retrieving from large sets ofdocuments efficiently.

· Relevance is a subjective judgment and may include:

- （1）Being on the proper subject.

- Being timely (recent information).

- Being authoritative (from a trusted source).

- Satisfying the goals of the user and his/her intended useof the information (information need).

Problems with Keywords

（1）·May not retrieve relevant documents that include synonymousterms.

-“restaurant”vs.“café”-

“PRC”vs. “China”

.（2） May retrieve irrelevant documents that include ambiguousterms.

-“bat”(baseball vs. mammal)-

“Apple”(company vs. fruit)

-“bit”(unit of data vs. act of eating

Other IR-Related Tasks

（1）Automated document categorization

（2）Information filtering (spam filtering)·

（3）Information routing

（4）Automated document clustering

（5）Recommending information or products

（6）Information extraction

（7）Information integration

（8）Question answering

History of IR·

1.1960-70’s:

（1） Initial exploration of text retrieval systems for "“ small”corpora of scientific abstracts, and law and businessdocuments.

（2）Development of the basic Boolean and vector-spacemodels of retrieval.

（3）Prof. Salton and his students at Cornell University arethe leading researchers in the area.

2.1980’s:

.Large document database systems, many run by companies:

. Lexis-Nexis Dialog· MEDLINE

3.1990’s:

（1）Searching FTPable documents on the Internet

-Archie. WAIS

（2）Searching the World Wide Web

- Lycos Yahoo Altavista

4.2000’s

（1）Link analysis for Web Search：Google

（2）Automated Information Extraction：Whizbang -Fetch Burning Glass-

（3）Question Answering

·TREC Q/A track

5.2000’s continued:

（1）Multimedia IR：Image Video Audio and music.

（2）Cross-Language IR：DARPA Tides

（3）Document Summarization

Related Areas

（1）Database Management

（2）Library and Information Science

（3）Artificial Intelligence

（4）Natural Language Processing

（5）Machine Learning

7.6推荐系统简介

Recommender System are software agents that elicitthe interests and preferences of individual consumers[...] and make recommendations accordingly.

They have the potential to support and improve the quality ofthe decisions consumers make while searching for andselecting products online

——[Xiao & Benbasat,MISQ,2007]

Movie Recommendation

l like "Drama" Movies

Here's the movies you may like ...

1.The Shawshank Redemption

2.Another Happy Day

3.…

Problem Domain

（1）Recommendation systems (RS) help to match users with items·

Ease information overload

.Sales assistance (guidance, advisory, persuasion,…)

（2）Recommendation perspective

Serendipity(意外新发现)-identify items from the Long Tail.

Users did not know about existence

The theory of the Long Tail is that our culture and economy is increasingly shiftingaway from a focus on a relatively small number of "hits"(mainstream productsand markets) at the head of the demand curve and toward a huge number ofniches in the tail.As the costs of production and distribution fall, especially online,there is now less need to lump products and consumers into one-size-fits-allcontainers. ln an era without the constraints of physical shelf space and otherbottlenecks of distribution,narrowly-targeted goods and services can be aseconomically attractive as mainstream fare.

Source: http://www.longtail.com/the_long_tail/about.html

When does a RS do its job well?

"Recommend widelyunknown items thatusers might actuallylike!”

20% of items

accumulate 74% of allpositive ratings

ltems rated > 3 in

MovieLens 100K dataset

Recommender System

1.User model

(e.g. ratings,preferences,demographics,

situational context)

2.Transactions