Loading...

An improved advertising CTR prediction approach based on the fuzzy deep neural network Zilong Jiang1,2, Shu Gao1, Mingjiang Li2* 1 School of Computer Science and Technology, Wuhan University of Technology, Wuhan, Hubei Province, China, 2 School of Computer and Information, Qiannan Normal University for Nationalities, Duyun, Guizhou Province, China

a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

OPEN ACCESS Citation: Jiang Z, Gao S, Li M (2018) An improved advertising CTR prediction approach based on the fuzzy deep neural network. PLoS ONE 13(5): e0190831. http://doi.org/10.1371/journal. pone.0190831 Editor: Yong Deng, Southwest University, CHINA

* [email protected]

Abstract Combining a deep neural network with fuzzy theory, this paper proposes an advertising click-through rate (CTR) prediction approach based on a fuzzy deep neural network (FDNN). In this approach, fuzzy Gaussian-Bernoulli restricted Boltzmann machine (FGBRBM) is first applied to input raw data from advertising datasets. Next, fuzzy restricted Boltzmann machine (FRBM) is used to construct the fuzzy deep belief network (FDBN) with the unsupervised method layer by layer. Finally, fuzzy logistic regression (FLR) is utilized for modeling the CTR. The experimental results show that the proposed FDNN model outperforms several baseline models in terms of both data representation capability and robustness in advertising click log datasets with noise.

Received: May 7, 2017 Accepted: December 20, 2017 Published: May 4, 2018 Copyright: © 2018 Jiang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: Data are available: kaggle-display-advertising-challenge-dataset (http://doi.org/10.6084/m9.figshare.5732310.v1); iPinYou Global RTB Bidding Algorithm Competition Dataset season2 (http://doi.org/10.6084/m9. figshare.5732328.v1). Funding: This work is supported by the Natural Science Research Project of Education Department of Guizhou Province of China (No.KY2017350), the Project Funds of the Digital Engineering Research Center for Local National Culture of Universities in Guizhou (No.KY2013227) and the Scientific Research Project of Qiannan Normal University for

Introduction Internet advertising is regarded as an effective advertising communication approach due to its strong targeted communication ability. Therefore, increasing numbers of researchers from industry and academia have investigated internet advertising, and it has become an important source of income for internet companies. The cost per click (CPC) model [1] is one of the most common payment models in internet advertising. Approximately 66% of advertising transactions depend on the CPC because the CPC can more accurately reveal the conversion rate compared with the other models [2]. In the CPC model, the click-through rate (CTR) is a significant index for measuring the effect of advertisement placement. To address this task, Chapelle O. et al. propose a machine learning framework that uses Maximum Entropy to implement a logistic regression model and can address billions of samples and hundreds of millions of parameters. A two-phase feature selection algorithm is provided to reduce the need for domain expertise: a generalized mutual information method is used to select the feature groups that are used in the model; next, feature hashing is used to regulate the size of the models [3]. McMahan et al. adopted a logistic regression model to solve advertising CTR problems for Google. Using multiple characteristics, including user

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

1 / 24

Advertising CTR prediction based on FDNN

Nationalities (No.qnsy2017006). The funder makes contributions on analysis, decision to publish. Competing interests: The authors have declared that no competing interests exist.

information, search keywords, advertising data, and relative metadata, with advertisements as the input of the model, these researchers proposed an online sparse learning algorithm to the train model [4]. These above methods are recently used to address CTR prediction through the logistic regression (LR) model in industry. These methods can appropriately address a large number of features in real advertising system in industry, and can be efficiently parallelized in a distributed computing system. However, they cannot learn complex mapping relations from data because of their limited data representation capability. T. A. Anh-Phuong presented an online learning algorithm for click-through rate prediction, known as Follow The Regularized Factorized Leader (FTRFL), which combines the Follow-The-Regularized-Leader Proximal (FTRL-P) algorithm with per-coordinate learning rates to obtain factorization machines. These researchers attempt to obtain both the sparsity provided by FTRL-Proximal and the ability to estimate higher-order features based on FM [5]. Z. Pan et al. proposed a Sparse Factorization Machine (SFM) model to solve the problem of the sparsity of the ad transaction dataset. In this model, the Laplace distribution is introduced to model the parameters because it can better fit the sparse data, which has a high ratio of zero elements. Furthermore, these researchers parallelized the SFM model on the Spark platform to support CTR prediction in a real advertising system [2]. These above methods are recently adopted to address CTR prediction through Factorization Machines (FMs) in academia. These methods can alleviate this problem of feature learning by compressing the sparse features into a dense vector space and using the second-order feature interactions. However, when facing a dataset with complex relations, they cannot sufficiently capture the higher-order non-linear representative features and complex inherent mapping relations among the users, contexts and ads in the advertising datasets. A. I. Baqapuri and I. Trofimov adopted Artificial Neural Networks (ANNs) to predict the advertising CTR and obtained a better effect than that of the logistic regression model [6]. Dave and other researchers proposed a model that inherits the click information of rare/new ads from frequent ads that are semantically related to calculate the CTR values for newly created ads or rare ads that do not have sufficient historical information. The semantic features are derived from the search ad click-through graphs and advertiser account information. Next, gradient boosted decision trees (GBDTs) is adopted to learn from these similarity features. Experiments demonstrate that this learned model, using these features, obtains good CTR prediction performance for new ads [7]. Y Juan et al. first used trees in GBDT to generate abstract nonlinear features and then combined the original features as input features, which were fed into the FM model for the advertising prediction [8]. These above methods are recently used to address CTR prediction through models ensemble technique, and they give state-of-the-art ideas for industry and academia. Zhang and other researchers used recurrent neural networks (RNNs) to predict the CTR of sponsored search advertising. Using back-propagation through time (BPTT) to train their model, these researchers obtained more accurate results for advertising CTR prediction than neural networks and logistic regression models [9]. Yu adopted the improved regression neural network to predict advertising CTR by using long short-term memory (LSTM) to modify RNN. This method effectively prevents the explosion or vanishing of the gradient [10]. Jiang et al. adopted the DBNLR model, which integrates a deep belief network (DBN) with logistic regression (LR) to address the problem of CTR prediction. A DBN stacked of RBMs is used to obtain abstract features from original advertisement datasets, and then a regression model is adopted to calculate the CTR prediction value [11]. Junxuan Chen et al. proposed a novel deep neural network that contains convolution layers to extract representative visual features, and fully connected layers learn the complex and effective nonlinear features based on the basic visual features. Finally, these features are fed into a logistic regression model to predict the

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

2 / 24

Advertising CTR prediction based on FDNN

CTR value [12]. These above methods are state-of-the-art solutions which are adopted to address CTR prediction task through deep learning technology. In these above methods, a deep architecture model is usually used to automatically extract the features of advertising click log data, and a regression model is subsequently trained to model the advertising CTR. Because a deep architecture is used to learn the features, there is no need for these methods to depend on prior knowledge or human labor; they can effectively fit complex nonlinear mapping relations in advertising datasets. However, when the dataset contains noisy data such as missing values and outliers, these methods may exhibit performance degradation. A common strategy for dealing with this problem is to identify and delete records with outliers during data preprocessing. However, deleting entire records results in the loss of much valuable and clean information. The other common strategy is to replace missing values and outliers with “0” or “NULL”; however, the records with outliers are preserved in the training set, which also affects the accuracy of CTR prediction. Uncertainties not only exist in the data themselves but occur at each phase of big data processing. For instance, the collected data may be created by faulty sensors or provided by not fully informed customers; the outputs of specific artificial intelligent algorithms also contain uncertainties. In these cases, fuzzy set techniques could be one of the most efficient tools to handle various types of uncertainties [13]. Because the parameters of these above deep architecture models are constants, and the learning processes of the parameters are constrained in a relatively small space, these methods have insufficient representation capability and relatively weak robustness when the training data have been interrupted by uncertainties. Meanwhile, fuzzy set techniques would be more efficient if they are used associated with other decision making techniques, such as probability, neural networks, etc., because each type of techniques exhibit their own strengths of representing and handling information granularity [13]. Due to the lack of a comprehensive understanding of the advertising datasets with noises, it is often difficult for these methods to ensure that the extracted features capture the optimal information for predicting the CTR. These deficiencies affect the accuracy and fitness of these methods. According to the advantages of fuzzy set techniques, this paper proposes a novel CTR prediction method based on the fuzzy deep neural network (FDNN) to address advertising datasets with noise. The main contributions of this paper can be summarized as follows: 1. This paper proposes an FDNN model in which FDBN is used to automatically extract abstract and complicated features from raw advertising datasets without any artificial intervention or prior knowledge and subsequently uses fuzzy logistic regression to model the CTR. 2. Fuzzy set techniques are introduced into the Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM), Restricted Boltzmann Machine (RBM) and logistic regression models to construct basic components of FDNN, and corresponding learning algorithms are presented in detail. 3. We conduct extensive experiments on real-world datasets to demonstrate the effectiveness and efficiency of the proposed FDNN model. The impacts of several baseline models are also discussed. Experiments results show the superiority of the proposed method. On the first dataset, the results show that the proposed method achieves competitive performances compared with the state-of-the-art models in terms of both data representation capability and robustness.

Materials and methods Relative theories of fuzzy numbers and fuzzy functions Fuzzy number. Definition 1. Triangular Fuzzy Number

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

3 / 24

Advertising CTR prediction based on FDNN

A fuzzy number y (fuzzy set) on a real domain can be defined as a triangular fuzzy number if its membership function my ðxÞ : R ! ½0; 1 is equal to [14], and it is shown in S1 Fig. 8 x l ; x 2 ½l; m > m l m l > > > > < my ðxÞ ¼ yðxÞ ¼ mx u mu u ; x 2 ½m; u ; ð1Þ > > > > > : 0; otherwise where l m u, l and u stand for the lower and upper values of the support of fuzzy number y (fuzzy set), respectively; and m is the most likely value (middle value), that is, when x = m, x belongs to y. The triangular fuzzy number can be denoted by (l, m, u). The support of y is the set of elements {x 2 R|l < x < u}. When l = m = u, it is a non-fuzzy number. Definition 2. α-cuts of a fuzzy set The α-cut of fuzzy set y, represented by y½a, is defined as y½a ¼ fx 2 Ojy ag ¼ ½yL ; yR , where 0 < α 1, θL is the lower value of y½a and θR is the upper value of y½a [15, 16]. Fuzzy function. Definition 3. Fuzzy Function Extended from a real-valued function f: Y = f(x, θ), the fuzzy function f can be defined as [16] Y ¼ f ðx; θÞ;

ð2Þ

where θ and θ are parameters of functions f and f [16, 17], Y is the fuzzy output set. The membership function Y ðyÞ can be expressed as Y ðyÞ ¼ supy fminðy 1 ðy1 Þ; :::; y n ðyn ÞÞjf ðx; θÞ ¼ yg;

ð3Þ

T

where θ = (θ1, . . ., θn)T, and θ ¼ ðy 1 ; :::; y n Þ Property 1. Real Domain Interval Arithmetic For two real domain intervals [a, b] and [c, d], the interval arithmetic can be defined as [16, 18] [a, b] + [c, d] = [a + c, b + d]; [a, b] − [c, d] = [a − c, b − d]; [a, b] × [c, d] = [min(a × c, a × d, b × c, b × d), max(a × c, a × d, b × c, b × d)]; [a, b] [c, d] = [min(a c, a d, b c, b d), max(a c, a d, b c, b d)]. Definition 4. α-Cuts of a Fuzzy Function α-Cuts of Y : For a continuous function f, the α-Cuts of Y , namely Y ½a ¼ ½Y 1 ½a; Y 2 ½a, can be expressed as [16] ( Y ½a ¼ minfYðθ; xÞjθ 2 θ½ag 1 :

ð4Þ

Y 2 ½a ¼ maxfYðθ; xÞjθ 2 θ½ag It is infeasible to calculate the membership function Y ðyÞ using formulas (2) and (3) because this requires maximization and minimization of the original function. This paper adopts α-cuts and interval arithmetic to solve this problem. Y ½a ¼ f ðx; θ½aÞ

ð5Þ

The membership function Y ðyÞ of a fuzzy function can be obtained by interval arithmetic because intervals θ½a are easy to calculate. However, when f is very complex, interval

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

4 / 24

Advertising CTR prediction based on FDNN

arithmetic will be NP-hard; therefore, this paper introduces a defuzzification approach to handle this problem. Defuzzification of the fuzzy function. The centroid approximate solution [19] is employed to defuzzify the fuzzy function Y ¼ f ðx; θÞ in this paper. The centroid of fuzzy function f ðx; θÞ can be denoted by fc ðx; θÞ [19]: R θf ðx; θÞdθ fc ðx; θÞ ¼ R ; θ 2 θ: ð6Þ f ðx; θÞdθ However, directly calculating formula (6) is difficult because it involves integrals. The centroid can be approximated through many α-cuts of the fuzzy function in discrete form. θ is a vector of fuzzy numbers and an α-cut of θ can be denoted as θ½a ¼ ½θL ; θR , where θL and θR are lower and upper bounds, respectively, of the interval with respect to α. When x is nonnegative, f ðx; θÞ is a monotonically decreasing function with respect to parameters θ. Thus, according to the interval arithmetic principle, definition 3 and formula (5), the α-cut of f ðx; θÞ is given by f ðx; θÞ½a ¼ f ðx; θ½aÞ ¼ ½f ðx; θL Þ; f ðx; θR Þ: The approximate centroid of fuzzy function f ðx; θÞ can be obtained by [20] PM a ½f ðx; θ Þ; f ðx; θiR Þ fc ðx; θÞ i¼1 i PMiL ; 2 i¼1 ai

ð7Þ

ð8Þ

where α = (α1, . . ., αN), α 2 [0, 1]N and θ½ai ¼ ½θiL ; θiR . Although all the α-cutsare bounded intervals, this paper takes only the special case where αi = 1 into consideration. Let θ = [θL, θR]. Formula (8) can be expressed as [20] 1 fc ðx; θÞ ½f ðx; θL Þ þ f ðx; θR Þ: 2

ð9Þ

Therefore, fuzzy function Y can be expressed as 1 Y ¼ f ðx; θÞ ¼ fc ðx; θÞ ½f ðx; θL Þ þ f ðx; θR Þ: 2

ð10Þ

Advertising CTR prediction based on the fuzzy deep neural network Basic fuzzy components and corresponding learning algorithms. Inspired by the idea of above fuzzy theory and methods [16–20], we first provide a detailed introduction and describe the corresponding learning algorithms of several basic components that have been modified using fuzzy technology. A. FRBM and its Learning Algorithm Fuzzy restricted Boltzmann machine (FRBM) is a symmetric neural network with binary nodes that is based on an energy model. It contains a set of visual binary nodes v 2 {0, 1}D and another set of hidden binary nodes h 2 {0, 1}F. There are no connections between different nodes of the same hidden layer or between different nodes of the same visual layer. Fuzzy parameters are employed to govern the FRBM model. The structure of FRBM is described in S2 Fig. The fuzzy energy function of FRBM can be expressed as [20] Eðx; h; θÞ ¼

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

T

b x

cTh

hT W x:

ð11Þ

5 / 24

Advertising CTR prediction based on FDNN

In this formula, θ ¼ fb; c; W g, b and c are the offsets, and W is the connection weight between the ith visible node and the jth hidden node. P ~ According to the free energy function Fðx; θÞ ¼ log ~h e Eðx;h;θÞ , the fuzzy free energy function F is expressed as X

Fðx; θÞ ¼

log

e

Eðx;~ h;θ Þ

:

ð12Þ

~ h

If Fðx; θÞ is used directly to define the probability, it is difficult to calculate the fuzzy probability, fuzzy maximum likelihood and fuzzy objective function in the optimization. Therefore, Fðx; θÞ needs to be defuzzifized to transform the optimization into regular maximum-likelihood optimization. This paper adopts a centroid method to defuzzify Fðx; θÞ. The centroid of Fðx; θÞ is expressed as Fc ðx; θÞ, and the probability can be defined as Pc ðx; θÞ ¼

e

X

Fc ðx;θ Þ

Z

e

;Z ¼

Fc ð~ x;θ Þ

:

ð13Þ

~ x

This paper selects the negative log-likelihood as the objective function of FRBM: X Lðθ; DÞ ¼

logPc ðx; θÞ:

ð14Þ

x2D

In the formula, D denotes the training dataset. The learning algorithm of FRBM finds the parameters θ that minimize the objective function Lðθ; DÞ (minLθ ðθ; DÞ). The paper adopts the centroid approximation method to defuzzify Fðx; θÞ: 1 Fðx; θÞ ¼ Fc ðx; θÞ ½Fðx; θL Þ þ Fðx; θR Þ: 2

ð15Þ

The gradients of Lðθ; DÞ with respect to θL can be expressed as @logPc ðx; θÞ @Fc ðx; θL Þ ¼ @θL @θL

Ep

@Fc ðx; θL Þ : @θL

ð16Þ

In the formula, Ep() is the expectation over the target probability distribution P. Correspondingly, @logPc ðx; θÞ @Fc ðx; θR Þ ¼ @θR @θR

Ep

@Fc ðx; θR Þ : @θR

ð17Þ

Because it is difficult to calculate the expectation Ep @[email protected]θðx;θÞ , an approximation approach is employed. After defuzzifying the objective function, Gibbs sampling [21] is adopted to sample from these conditional distributions. For FRBM, the fuzzy conditional probabilities can be expressed as Pðhj ¼ 1jxÞ ¼ sðc j þ W j xÞ and Pðxi ¼ 1jhÞ ¼ sðb i þ W Ti hÞ. The α-cuts of the fuzzy conditional probabilities are described as Pðhj ¼ 1jxÞ½a ¼ ½PL ðhj ¼ 1jxÞ; PR ðhj ¼ 1jxÞ and Pðxi ¼ 1jhÞ½a ¼ ½PL ðxi ¼ 1jhÞ; PR ðxi ¼ 1jhÞ.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

6 / 24

Advertising CTR prediction based on FDNN

In these formulas, PL(hj|x), PR(hj|x), PL(xi|h) and PR(xi|h) are the conditional probabilities with respect to the lower bounds and upper bounds of the parameters. Consequently, PL ðhj jxÞ ¼ Pðhj jx; θL Þ ¼ sðcLj þ WjL xÞ; PR ðhj jxÞ ¼ Pðhj jx; θR Þ ¼ sðcRj þ WjR xÞ;

ð18Þ

and PL ðxi jhÞ ¼ Pðxi jh; θL Þ ¼ sðbLi þ WiL hÞ;

ð19Þ

PR ðxi jhÞ ¼ Pðxi jh; θR Þ ¼ sðbRi þ WiR hÞ: In these formulas, WijL and WijR are the lower bound and upper bound of the connection weight, bLi and bRi are the lower bound and upper bound of the visible bias, and cLj and cRj are the lower bound and upper bound of the hidden bias [20]. According to (15), (16), (17) and the expectation estimation method of [20, 22], the gradients of Lðθ; DÞ with respect to the fuzzy parameters of FRBM can be obtained as follows: @logPc ðxÞ ¼ Ep ½PL ðhj jxÞ xiL @WijL @logPc ðxÞ ¼ Ep ½PL ðhj jxÞ @cLj @logPc ðxÞ ¼ Ep ½PL ðxi jhÞ @bLi @logPc ðxÞ ¼ Ep ½PR ðhj jxÞ xiR @WijR @logPc ðxÞ ¼ Ep ½PR ðhj jxÞ @cRj @logPc ðxÞ ¼ Ep ½PR ðxi jhÞ @bRi

PL ðhj jxÞ xiL ;

PL ðhj jxÞ;

xiL ;

PR ðhj jxÞ xiR ;

PR ðhj jxÞ;

xiR ;

where Pc(x) is the centroid probability. The CD1 algorithm [22] is employed to obtain the updating rules for parameters (θL and θR) to approximate the expectation Ep @[email protected]θðx;θÞ [20, 22]. The above description is summarized as Algorithm 1. Algorithm 1 The learning algorithm of FRBM Input: x(0) is an input sample of the training set; ε is the learning rate; WL and WR are the lower and upper bounds of connection weight matrices of the visible and hidden layers; bL and bR are the lower and upper bounds of bias vectors of the visible nodes; cL and cR are the lower and upper bounds of bias vectors of the hidden nodes. Output: The latest parameters of FRBM: WL, WR, bL, bR, cL, cR.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

7 / 24

Advertising CTR prediction based on FDNN

BEGIN 1. For all hidden nodes j do obtain PL ðhLð0Þ ¼ 1jxð0Þ Þ and PR ðhRð0Þ ¼ 1jxð0Þ Þ by means of formula (18); j j sample hLð0Þ 2 f0; 1g from PL ðhLð0Þ jxð0Þ Þ; j j sample hRð0Þ 2 f0; 1g from PR ðhRð0Þ jxð0Þ Þ; j j end for 2. For all visible nodes i do gain PL ðxiLð1Þ ¼ 1jhLð0Þ Þ and PR ðxiRð1Þ ¼ 1jhRð0Þ Þ by means of formula (19); sample xiLð1Þ 2 f0; 1g from PL ðxiLð1Þ jhLð0Þ Þ; sample xiRð1Þ 2 f0; 1g from PR ðxiRð1Þ jhRð0Þ Þ; end for 3. For all hidden nodes j do obtain PL ðhLð1Þ ¼ 1jxLð1Þ Þ and PR ðhRð1Þ ¼ 1jxRð1Þ Þ by means of formula (18); j j sample hLð1Þ 2 f0; 1g from PL ðhLð1Þ jxLð1Þ Þ; j j sample hRð1Þ 2 f0; 1g from PR ðhRð1Þ jxRð1Þ Þ; j j end for 4. The parameters can be updated according to the following rules: WL = WL + ε(x(0) PL(hL(0) = 1|x(0)) − xL(1) PL(hL(1) = 1|xL(1))); bL = bL + ε(x(0) − xL(1)) cL = cL + ε(PL(hL(0) = 1|x(0) − PL(hL(1) = 1|xL(1))); WR = WR + ε(x(0) PR(hR(0) = 1|x(0)) − xR(1) PR(hR(1) = 1|xR(1))); bR = bR + ε(x(0) − xR(1)); cR = cR + ε(PR(hR(0) = 1|x(0) − PR(hR(1) = 1|xR(1))); 5. return WL, WR, bL, bR, cL, cR. END

B. FGBRBM and Its Learning Algorithm The nodes of the visible and hidden layers in the fuzzy Gaussian-Bernoulli restricted Boltzmann machine (FGBRBM) model correspond to the Gaussian-Bernoulli nodes. In other words, the nodes of visible layer v 2 RD are Gaussian nodes and the nodes of the first hidden layer h 2 {0, 1}F are Bernoulli nodes (binary nodes) [23]. Their structures are described in S3 Fig. The fuzzy energy function of FGBRBM can be defined as Eðv; h; θÞ ¼

1 v σ

T

T

1 T 1 v 2 σ

Wh

1 b σ

2

T

T

b h;

ð20Þ

where σi denotes the standard deviation of vi in the ith dimension of the Gaussian visual node. Correspondingly, as for FGBRBM, the fuzzy conditional probability distribution for obtaining nodes of the hidden layer through nodes of the visual layer can be described as: Pðhj ¼ 1jvÞ ¼ g c j þ W i

! Xv i ¼ g cj þ W ; si ij i

ð21Þ

where g() denotes the sigmoid function. The formula for constructing the nodes of the visual layer through nodes of the hidden layer can be expressed as F X

vi ¼ Nðb i þ si

W ij hj ; s2i Þ:

ð22Þ

j¼1

In this formula, N(μ, σ2) is a Gaussian probability density function, where μ is the mean value and σ2 denotes the variance.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

8 / 24

Advertising CTR prediction based on FDNN

The α-cuts of fuzzy conditional probabilities can be expressed as Pðhj ¼ 1jvÞ½a ¼ PF ½PL ðhj ¼ 1jvÞ; PR ðhj ¼ 1jvÞ, v i ½a ¼ Nðb i þ si j¼1 W ij hj ; s2i Þ½a ¼ ½viL ; viR , where PL(hj|v) and PR(hj|v) are the conditional probabilities with respect to the lower and upper bounds of the parameters governing the FGBRBM, and viL and viR are the probabilities of the normal distribution. Consequently, PL ðhj jvÞ ¼ Pðhj jv; θL Þ ¼ sðcLj þ WjL vÞ; PR ðhj jvÞ ¼ Pðhj jv; θR Þ ¼ sðcRi þ WjR vÞ;

ð23Þ

and viL ¼ N bLi þ si

F X

! WijL hj ; s2i ;

j¼1

viR ¼ N bRi þ si

F X

!

ð24Þ

WijR hj ; s2i ;

j¼1

where WijL and WijR are the lower and upper bounds of the connection weight, respectively; bLi and bRi are the lower and upper bounds of visible node bias, respectively; and cLj and cRj are the lower and upper bounds of the hidden node bias, respectively. The CD-1 algorithm can also be used for learning parameters of the fuzzy Gaussian-Bernoulli RBM model. In the process of training, the input data of the whole training set should first be normalized so that each dimension of the input data in the FGBRBM model obeys the normal distribution, i.e., the mean value is 0 and the variance is 1. When the formula Pðvi ¼ 1jh; θÞ is calculated, σ = 1. In the process of reconstructing the nodes of the visual layer, this paper does not adopt binary data instead of the probabilities of the normal distribution. The above description is summarized as Algorithm 2. Algorithm 2 The learning algorithm of FGBRBM Input: x(0) is a sample of the training dataset; ε is the learning rate; WL and WR are the lower and upper bounds of connection weight matrices of the visible and hidden layers; bL and bR are the lower and upper bounds of bias vectors of the visible nodes; cL and cR are the lower and upper bounds of bias vectors of the hidden nodes. Output: The latest parameters of FGBRBM: WL, WR, bL, bR, cL, cR. BEGIN 1. For all hidden nodes j do obtain PL ðhLð0Þ ¼ 1jvð0Þ Þ and PR ðhRð0Þ ¼ 1jvð0Þ Þ by means of formula (23); j j sample hLð0Þ 2 f0; 1g from PL ðhLð0Þ jvð0Þ Þ; j j sample hRð0Þ 2 f0; 1g from PR ðhRð0Þ jvð0Þ Þ; j j end for 2. For all visible nodes i do obtain viL and viR by means of formula (24); PF viLð1Þ ¼ NðbLi þ j¼1 WijL hLð0Þ ; 1Þ; j PF Rð1Þ R R Rð0Þ vi ¼ Nðbi þ j¼1 Wij hj ; 1Þ; end for 3. For all hidden nodes j do obtain PL ðhLð1Þ ¼ 1jvLð1Þ Þ and PR ðhRð1Þ ¼ 1jvRð1Þ Þ by means of formula (23); j j

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

9 / 24

Advertising CTR prediction based on FDNN

sample hLð1Þ 2 f0; 1g from PL ðhLð1Þ jvLð1Þ Þ; j j sample hRð1Þ 2 f0; 1g from PR ðhRð1Þ jvRð1Þ Þ; j j end for 4. The parameters can be updated according to the following rules: WL = WL + ε(v(0) PL(hL(0) = 1|v(0)) − vL(1) PL(hL(1) = 1|vL(1))); bL = bL + ε(v(0) − vL(1)) cL = cL + ε(PL(hL(0) = 1|v(0) − PL(hL(1) = 1|vL(1))); WR = WR + ε(v(0) PR(hR(0) = 1|v(0)) − vR(1) PR(hR(1) = 1|vR(1))); bR = bR + ε(v(0) − vR(1)) cR = cR + ε(PR(hR(0) = 1|v(0) − PR(hR(1) = 1|vR(1))); 5. return WL, WR, bL, bR, cL, cR. END

C. FLR and Its Learning Algorithm Used for Modeling the Click Through Rate The logistic regression (LR) model [24] is a classic model in the field of advertising click-rate prediction. The LR model is described in S4 Fig. The activation function of the node in the output layer is a sigmoid function [25]. The output value of the node (value of the activation function) hθ(x0 ) is the probability that a user will click on the advertisement; it can also be described as P(y = 1|x0 ; θ0 ). When a sample (x0d ; yd ) in the training set D ¼ ðx01 ; y1 Þ; ðx02 ; y2 Þ; :::; ðx0N ; yN Þ is given, the probability value of advertising click rate prediction can be obtained in terms of logistic regression and can be described by formula (25)

pðclickja; u; cÞ ¼ pðyd ¼ 1jx0d ; θ0 Þ ¼

Correspondingly, pðy ¼ 0jx0 ; θ0 Þ ¼ 1

1 1þe

θ0T x0 d

1 1þe

θ0T x0

:

ð25Þ

d

.

0T

In this formula, θ denotes the vector of parameters of the standardized logistic regression model, a refers to the features of the advertisement, u denotes the features of users, and c represents the features of the context environment. These three features compose a vector, which 0 can be denoted as xd . This vector is fed into the logistic regression model for CTR prediction. 0 For a single sample (xd ; yd ), the probability of correctly predicting the advertising clicky

through rate can be described as Pðyd jx0d ; θ0 Þ ¼ hθ0 ðx0d Þ d ð1 the likelihood function as the objective function:

Jðx0d ; θ0 Þ ¼

hθ0 ðx0d ÞÞ

ð1 yd Þ

. This paper selects

1 logðPðYD jX 0D ; θ0 ÞÞ N

¼

N Y 1 y log ðhθ0 ðx0d Þ d ð1 N d¼1

¼

N 1X ðy loghθ0 ðx0d Þ þ ð1 N d¼1 d

! hθ0 ðx0d ÞÞ

1 yd

yd Þlogð1

Þ

ð26Þ

hθ0 ðx0d ÞÞÞ:

In formula (26), the click label yd denotes the expected output and hθ0 ðx0d Þ is the real output value of the neuron node in the output layer (output of the activation function); hy0 ðx0d Þ can P be described as hθ0 ðx0d Þ ¼ gðzÞ, where z ¼ i ðwi xi þ bÞ ¼ θ0T x0d and g(x) denotes the sigmoid function.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

10 / 24

Advertising CTR prediction based on FDNN

As for the fuzzy logistic regression model, the objective function can be described as J ðx0d ; θ0 Þ

¼

1 logðPðYD jX 0D ; θ0 ÞÞ N

¼

N Y 1 y log ðh θ0 ðx0d Þ d ð1 N d¼1

¼

N 1X ðy logh θ0 ðx0d Þ þ ð1 N d¼1 d

! 0 d

h θ0 ðx ÞÞ

1 yd

yd Þlogð1

Þ

ð27Þ

h θ0 ðx0d ÞÞÞ;

T P where h θ0 ðx0d Þ ¼ g ðz Þ, z ¼ i ðwi xi þ bÞ ¼ θ0 x0d and gðxÞ ¼ 1þe1 x . According to formulas (5) and (7), the gradient of the objective function of FLR with

respect to fuzzy parameter θ0 can be expressed as @J ðx0d ; θ0 Þ @θ0 L

¼

@Jc ðx0d ; θ0 Þ @θ0 L 0 N 1 X 1 @hθ0L ðxd Þ ðyd N d¼1 hθ0 ðx0d Þ @θ0L

¼

ð1

yd Þ

L

N 1 X ¼ ðh 0 ðx0 Þ N d¼1 θL d N X

¼C

ðhθ0 ðx0d Þ L

d¼1

@J ðx0d ; θ0 Þ @θ0 R

¼

1

@hθ0 ðx0d Þ 1 L Þ hθ0 ðx0d Þ @θ0L L

ð28Þ

yd Þx0d

yd Þx0d ;

@Jc ðx0d ; θ0 Þ @θ0 R 0 N 1 X 1 @hθ0R ðxd Þ ðyd N d¼1 hθ0 ðx0d Þ @θ0R

¼

ð1

R

N 1 X ¼ ðh 0 ðx0 Þ N d¼1 θR d N X

¼C d¼1

ðhθ0 ðx0d Þ R

yd Þ

1

@hθ0 ðx0d Þ 1 R Þ hθ0 ðx0d Þ @θ0R R

ð29Þ

yd Þx0d

yd Þx0d :

The above description is summarized as Algorithm 3. Algorithm 3 The learning algorithm of FLR Input: ðx0d ; yd Þ is a sample of training sets; ε is the learning rate; θ0L ¼ fW L ; cL g and θ0R ¼ fW R ; cR g, where WL and WR are the lower and upper bounds of connection weight matrices of FLR, and cL and cR are the lower and upper bounds of bias vectors of the output node of FLR; Output: The latest parameters of FLR: θ0L ¼ fW L ; cL g, θ0R ¼ fW R ; cR g. BEGIN 1. obtain

@Jc ðθ0 Þ @θ0L

by means of formula (28) based on the input data x0d ;

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

11 / 24

Advertising CTR prediction based on FDNN

2. obtain

@Jc ðθ0 Þ @θ0R

by means of formula (29) based on the input data x0d ;

3. The parameters can be updated according to the following rules: 0

c ðθ Þ θ0L ¼ θ0L þ ε @[email protected]θ 0 ; L

0

c ðθ Þ θ0R ¼ θ0R þ ε @[email protected]θ 0 ; R

4. return θ0L ; θ0R . END

Advertising CTR prediction model FDNN. A. Construction of Advertising CTR Prediction Model FDNN This section describes the proposed fuzzy deep neural network (FDNN) model for advertising CTR prediction and its learning algorithm in detail. The architecture of the method and its process of predicting CTR are illustrated in S5 Fig. First, the fuzzy deep belief network (FDBN) constituted by FRBM and GBRBM is used to automatically learn the discriminative features of raw input data from an advertising dataset, and these features are able to capture internal explanatory information that is hidden in the raw data, amplify the important information for discrimination and weaken irrelevant information [26]. Next, FLR is adopted to model the CTR prediction and map the predicted CTR value to the range between 0 and 1. The raw input data of advertising CTR prediction are the fields that include the corresponding advertisement features, users’ features, and context features, which are extracted from the advertisement click log. These fields contain not only numerical data but also multi-category data; therefore, the raw input data of advertising CTR prediction are composed of vectors with different data types. Because FRBM only models input with binary variance (0, 1), this paper adopts fuzzy Gaussian-Bernoulli RBM (FGBRBM) to address the input data of advertising CTR prediction. Then, FRBM can be used to construct FDBN through layer-by-layer stacking; the detailed steps can be found in [22]. In real applications, advertising click rate prediction refers to binary classification or regression; therefore, this paper uses fuzzy logistic regression to model user behaviors in clicking advertisements. After the input data vector x = (a, u, c) = (x1, x2, . . ., xn) is extracted by FDBN, 0 the vector x , which is composed of the binary values, in the last hidden layer of FDBN is fed into fuzzy the logistic regression model for CTR prediction. The advantage of the proposed method is that fuzzy set techniques are introduced, and the uncertainties in the relationships between nodes located in adjacent layers is taken into consideration. Nodes in adjacent layers of FDNN often interact in uncertain ways. Because the parameters that represent the relationships between nodes in adjacent layers are fuzzy numbers and the learning process of fuzzy parameters is extended to a relatively wider space, this advantage will be reflected in the fitness of the joint probability distribution. Combined with the merits of deep learning, the proposed method demonstrates superior performance in coping with data with noises. B. Construction and Training of the Advertising CTR Prediction Model FDNN The process of constructing FDNN is illustrated in S6 Fig, and the detailed steps of constructing the FDNN are as follows: Step 1: The training of FRBM layer by layer First, the vectors that are constituted by related data of textual advertising click logs are regarded as the input of FGBRBM, and the CD1 algorithm is used to train FGBRBM; in this way, the parameter θ1 of FGBRBM is obtained. Based on the input vectors and the parameter θ1, the hidden nodes h1 of FGBRBM are obtained.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

12 / 24

Advertising CTR prediction based on FDNN

Second, the parameter θ2 of the current FRBM can be obtained when h1 is regarded as the input data for the CD1 algorithm. Based on h1 and θ2, the hidden nodes h2 of the current FRBM can be obtained. Third, the previous process is repeated until reaching the fixed layer. Step 2: Constructing FDBN by means of FRBMs According to the training procedure of FRBM in step 1, the weight values of FRBMs are connected from bottom to top to construct FDBN. In this FDBN, the connection between the top two layers is undirected and the other connection is directed from top to bottom [22]. Step 3: Constructing the FDNN that is used for advertising CTR prediction by means of FDBN After an FLR model is added onto the top of the FDBN, the probability that the user clicks this advertisement is equal to the output value of the nodes of the input layer. At this point, the network becomes the initially trained fuzzy deep neural network (FDNN). Step 4: Fine-tuning the parameters of FDNN After FDNN is initially trained layer by layer, a fine-tuning process needs to be conducted. In contrast to the unsupervised learning approach in the initial training, the fine-tuning is implemented using supervised learning. Based on the weight values of the network that were obtained from the initial training, error backpropagation [27, 29] and stochastic gradient descent are used to update the weight values and bias of the neural network. After fine-tuning, the process of training FDNN is complete. The output of FDNN is in the form of a probability: pðclickja; u; cÞ ¼ pðyd ¼ 1jx0d ; θ0 Þ. The process of fine-tuning the parameters of FDNN contains two phases. 1) Forward calculation of the input signal: When labeled data (xd, yd) of click logs are given, that is, the combined feature vector x = (a, u, c) = (x1, x2, . . ., xn) is given, the jth node in the lth layer (l = 2, ‥, Ln) of the FDNN deep neural network can be defined as follows: Ln denotes the total number of layers in the FDNN network; sl denotes the number of nodes in the lth layer of the FDNN network; wlji denotes the connection weight between the jth node in the lth layer and the ith node in Psl 1 l l 1 the l − 1th layer; blj denotes the bias of the jth node in the lth layer; netjl ¼ blj þ i¼1 wji oi denotes the weighted-sum input of the jth node in the lth layer; olj denotes the output value of the activation function of the jth node in the lth layer; The selected activation function is sigmoid function. In the forward calculation, the input signal is transmitted from the input layer to the output layer. The input value of the jth node in the second layer of FDNN (l = 2) can be described as Ps1 2 netj2 ¼ b2j þ i¼1 wji xi . When the input of node j is transferred by the activation function, its output value can be described as o2j ¼ gðnetj2 Þ. Therefore, the input value of the jth node in the Psl 1 l lth layer of FDNN (l = 3, . . ., Ln) can be described as netjl ¼ blj þ i¼1 wji oi and its corresponding output value can be described as olj ¼ gðnetjl Þ. In particular, the output value of the jth node in the th layer of FDNN can be described as h y ðxd Þ ¼ Oj ¼ O1 ¼ OL1n :

ð30Þ

2) Backpropagation of the error signal: This paper regards the cross-entropy as the objective function [25]. It can be described as formula (31). The local gradient will be refined using backpropagation. The error signal between the output value and labeled signal will be transmitted from the output layer to the

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

13 / 24

Advertising CTR prediction based on FDNN

input layer in the process. J ðxd ; yÞ ¼ yd logh y ðxd Þ þ ð1

yd Þlogð1

h y ðxd ÞÞ ð31Þ

L

¼ yd loggðnetj n Þ þ ð1

L

yd Þlogð1

gðnetj n ÞÞ;

In this formula, yd is the label value of the sample, which can be represented as yd 2 [0, 1]. The gradient descent method will be adopted to learn the model. This paper will first define the residual error dlj [28] of the jth node in the lth layerlayer as the partialand of objective function J with respect to the weighted sum netjl of node j. The residual error dLj n of the j th neuron node of the Ln th layer can be calculated by formula (32) dLj n

@J

¼

Ln j

@net

¼ yd

1 Ln j

gðnet Þ

g 0 ðnetjLn Þ þ ð1

yd Þ

1 Ln j

gðnet Þ

1

ð32Þ

gðnetjLn Þ

¼ yd

g 0 ðnetjLn Þ

L

¼ yd O1n : The residual error dlj of the jth neuron node of the lth layer (l = 2, . . ., Ln − 1) can be calculated by formula (33) according to the chain rule [29]. dlj

¼

@J @J @olj ¼ @netjl @olj @netjl slþ1 X

¼ i¼1

slþ1 X

¼

! @netilþ1 @netilþ1 @olj @J

! lþ1 i

d

lþ1 ji

w

¼

@olj @netjl

@gðnetjl Þ

ð33Þ

@netjl

i¼1

slþ1 X

! lþ1 i

d

lþ1 ji

gðnetjl Þð1

w

gðnetjl ÞÞ

i¼1

slþ1 X

¼

! dlþ1 wlþ1 olj ð1 ji i

olj Þ;

i¼1

where wlþ1 denotes the connection weight between the jth node of the lth layer and the ith ji node of l + 1th layer. According to formulas (2) and (10), the gradient of cost function J ðW ; bÞ with respect to parameters wlji and blj can be calculated by formulas (34), (35), (36) and (37). @J @J @netjl ¼ ¼ dlj L oli @wljiL @netjl @wljiL

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

1

L

;

ð34Þ

14 / 24

Advertising CTR prediction based on FDNN

@J ¼ dlj L ; @bljL @J @J @netjl ¼ ¼ dlj R oli @wljiR @netjl @wljiR

ð35Þ

1

R

;

ð36Þ

@J ¼ dlj R : @bljR

ð37Þ

The above process is described in Algorithm 4. Algorithm 4 The algorithm for fine-tuning the parameters of FDNN using BP Input: (xd, yd) is the labeled training data of textual advertising click logs; ε is the learning rate; wljiL is the lower bound of the connection weight between the jth node in the lth layer and the ith node in the (l − 1)th layer of FDNN; bljL is the lower bound of the bias of the jth node in the lth layer of FDNN; wljiR is the upper bound of the connection weight between the jth node in the lth layer and the ith node in the (l − 1)th layer of FDNN; bljR is the upper bound of the bias of the jth node in the lth layer of FDNN; Output: The latest parameters: wljiL , wljiR , bljL , bljR BEGIN 1. In the forward-propagation calculation, when the input data vector xd is fed into the input layer, the output value of FDNN can be obtained using formula (30); 2. In the calculation of error backpropagation, obtain @[email protected] by means of formula (34); ji L

obtain

@J @bljL

obtain

@J @wlji R

obtain

@J @bljR

by means of formula (35); by means of formula (36); by means of formula (37);

3. The parameters can be updated according to the following rules: wljiL ¼ wljiL þ ε @[email protected] ; ji L

bljL ¼ bljL þ ε @[email protected] ; jL

wljiR ¼ wljiR þ ε @[email protected] ; ji R

bljR ¼ bljR þ ε @[email protected] ; jR

4. return wljiL , bljL , wljiR , bljR . END

Experiment datasets and environment Description of dataset. This paper uses the Criteo Display Advertising Challenge dataset, which is provided by the Criteo company on the famous machine learning website Kaggle for advertising CTR competition [30]. Every advertisement log record contains the set of features of the displayed advertisement, which is composed of 13 numerical features and 26 categorical

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

15 / 24

Advertising CTR prediction based on FDNN

features. The detailed formats of these features are given in S1 Table. The class Label is used to indicate whether this advertisement is clicked: if the advertisement is clicked, the value of Label is 1; otherwise, the value of Label is 0. Categorical features are desensitized using Harsh mapping; therefore, real meaning can not be obtained from them. To adequately evaluate the performance of this model and prevent the interference of many factors such as data timeliness, we randomly disrupt the order of the raw samples in the experiment. Meanwhile, to avoid the over-fitting, the 5-fold cross validation method [31] is adopted in the training process of all experiments. Experiment environment. Six high-performance workstations are used for the experiments. The hardware and software configurations of the experimental environment are shown as S2 Table.

Results and analysis Baseline models This paper selects the logistic regression (LR), Factorization Machines (FMs) [32], GBDT+FM [8], and DBNLR [11] models as the baseline models for performance comparison.

Performance evaluation metric of models AUC. This paper uses the area under the receiver operating characteristic curve (AUC) [33, 34] as the main performance evaluation metric for advertising CTR prediction. The confusion matrix is shown as S3 Table [35, 36]. Formulas TPR = TP/(TP + FN) and FPR = FP/(FP + TN) are used to calculate the corresponding TPR and FPR values of the confusion matrix to obtain a co-ordinate pair (dot pair). The dot pairs make up the receiver operating characteristic (ROC) curve [37]. The AUC is the total area under the ROC. The larger the AUC is, the more accurate the advertising CTR prediction is, and the better the effect achieved by the advertising placement is. Log-loss. This paper selects the logarithmic loss function ‘log-loss’ [38, 39] as another auxiliary performance evaluation metric for advertising CTR prediction. logloss ¼

L 1X y logy i þ ð1 L i¼1 i

yi Þlogð1

y i Þ;

ð38Þ

where L is the number of samples, yi is the true label (0 or 1), and y i is the predicted probability. The value of log-loss reflects the similarity between the CTR predicted by model and real CTR. The smaller the value of log-loss is, the more accurate the advertising CTR prediction is.

Experiments design and results analysis Configuration optimization and performance analysis of FDNN and DBNLR. A. Feature Pre-processing FDNN is similar with DBNLR in the architecture. Their difference lies in the components of model. Therefore, this paper selects the DBNLR as a baseline model. In the experiment on the FDNN and DBNLR models, the numerical features of the raw dataset are directly adopted, and these categorical features are expanded to binary vectors using one-hot representation. Although the method of one-hot representation is straightforward and effective, it makes the feature vectors sparse and high-dimensional. In the deep learning models, when the dimensionality of the features is large, the scale of the parameters of the model becomes large, which leads to a sharp increase in training time. Moreover, the hardware will be a

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

16 / 24

Advertising CTR prediction based on FDNN

bottleneck in the training process. Therefore, this paper adopts a feature hashing strategy (signed hashing trick) [40] to reduce the dimensionality of these sparse binary feature vectors. Finally, the joint vector that is composed of numerical features and categorical features and has been processed by the feature hashing strategy is regarded as the input of the two deep architecture models. Input vectors can be obtained when the data of the Display Advertising Challenge dataset are preprocessed using the feature preprocessing methods described above. In this experiment, after preprocessing, the dimensionality of the input vector is 1746088. Therefore, the number of nodes in the visible layer (input layer) is 1746088. B. The Parameter Settings for Training FDNN and DBNLR These two models are developed based on the Theano framework [41] by means of Python. In all experiments, the maximum number of learning epochs in the training process is set to a sufficiently large value (800) and the Early-Stopping strategy is adopted to automatically cut epochs for obtaining the best validation score. Furthermore, this strategy can also avoid overfitting. The steps of the Early-Stopping strategy are as follows: 1. Split the dataset into a training set and a validation set; 2. When the training of every epoch has been finished, calculate the validation error (loss) in the validation dataset; 3. When the validation error of the validation dataset no longer decreases, record the number of epochs and the current validation error value. If the validation error of the validation dataset does not increase after continuation, the training process is terminated, the best validation score is obtained, and optimization is complete. Meanwhile, the optimal parameter estimation values are obtained. In the initial training process of FDBN and DBNLR, the gradient descent method with minibatch is used for training on these samples and each minibatch includes 2000 training samples. For FGBRBM (or GBRBM), the learning rate is 0.01, while for FRBM (or RBM) with the binary variable, the learning rate is 0.1. The decay factor of the weight value is 0.0002. In the fine-tuning process of FDNN and DBNLR, the gradient descent method with minibatch is used for training on these samples and each minibatch includes 4000 training samples. The decay factor of the weight value is 0 and the learning rate is set as 0.005. Dropout represents the probability that a node is retained in the neural network [42, 43]. A reasonable sparse dropout strategy can strengthen the robustness of a neural network. Dropout is utilized to reduce over-fitting and the complexity of these two deep learning models. The paper sets the dropout rate to 0.5, 0.6, 0.7, 0.8, and 0.9 to implement the testing experiments. Under the condition of having the same configuration structure, the dropout rates are respectively 0.5 and 0.6 when FDNN and DBNLR use their own optimal settings. C. Optimization of the number of hidden layers and hidden nodes Currently, there is no method for quickly setting the number of hidden layers and the number of hidden nodes in each layer. Therefore, many experiments and much experience are utilized to search for the optimal structure. According to past experience, when the number of nodes of the visible layer is large, the nodes of the hidden layer need to perform “dimensionality reduction- dimensionality addition—dimensionality reduction”. However, when the number of nodes of the visible layer is small, the nodes of the hidden layer only need to perform “dimensionality addition—dimensionality reduction”. In general, the number of hidden nodes will be increased or decreased with multiple. By comparing the performances in many experiments with different configurations (different numbers of nodes of each hidden layer and numbers of hidden layers), the configuration with relatively optimal performance can be obtained.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

17 / 24

Advertising CTR prediction based on FDNN

In S4 Table, information on different configurations in many experiments is given. In S7 Fig, the corresponding AUC values of DBNLR and FDNN with different configurations are given. In S8 Fig, the corresponding log-loss values of DBNLR and FDNN with different configurations are given. As presented in S7 and S8 Figs, in the experiments for training FDNN, in the beginning, increasing the number of hidden layers can improve the performance of the proposed method; however, when the number of hidden layers is more than 3, the performance is degraded. This is because FDNN does not adequately capture feature interactions when the number of hidden layers is small. However, when the number of hidden layers and the number of nodes in the hidden layers exceed a certain scale, cross-features disturb and degrade the accuracy of CTR prediction. The over-fitting phenomenon is observed and the algorithm spends more time learning the parameters. After many experiments have been completed, a configuration with relatively optimal performance is obtained: Conf11, the number of hidden layers is 3; the number of nodes of the first hidden layer is 170; the number of nodes of the second hidden layer is 1700; and the number of nodes of the third hidden layer is 17. This paper chooses this configuration as the final structure of FDNN. Corresponding AUC and log-loss on the test dataset can be seen in S9 and S10 Figs. In the experiments for training DBNLR, relatively optimal performance is obtained when the configuration structure is as follows: Conf7, the number of hidden layers is 3; the number of nodes of the first hidden layer is 150; the number of nodes of the second hidden layer is 1500; and the number of nodes of the third hidden layer is 15. This paper chooses this configuration as the final structure of DBNLR. Corresponding AUC and log-loss on the test dataset can be seen in S9 and S10 Figs. We observe that the curve of the AUC value of the DBNLR model and that of the FDNN model are indented. However, when they have the same configuration, the overall level of performance of DBNLR is worse than that of FDNN; this is true for the log-loss index. It is because fuzzy set techniques are introduced in the proposed FDNN model. The components of FDNN model are fuzzy RBM, fuzzy GBRBM and fuzzy LR. With the data of many records in the Display Advertising Challenge dataset which contains outliers and missing values implying many uncertainties, the uncertainties can be taken into consideration and efficiently handled by fuzzy set techniques; therefore, the performance of the FDNN model for CTR prediction is better than that of DBNLR in the experiment. Configuration optimization of other baseline models. A. LR Solution CTR prediction in a real advertising system involves a large number of samples, while logistic regression (LR) model can appropriately address a large number of features and can be trained rapidly, so the LR model is selected as a baseline model in this paper. This paper selects LIBLINEAR [44] to implement L2-regularized logistic regression, and the stochastic gradient method (SG) is adopted to optimize the parameters of LR. In the LR solution, the numerical features are used directly and the categorical features are first represented with one-hot encoding and then used to establish the feature space with MurmurHash 3. After many experiments, this paper adopts the following parameters: η = 0.2, λ = 0. B. FM Solution For advertising CTR prediction, Factorization Machines (FMs) combine the advantages of Support Vector Machines with factorization models, and can characterize the basic feature interactions of a large number of advertising data, therefore, this paper selects the FM model as a baseline model [45]. For advertising CTR prediction, Factorization Machines (FMs) is also adopted, which is based on feature engineering and matrix design. This paper adopts LIBFM to implement factorization machines [46, 47] and the stochastic gradient method (SG) is

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

18 / 24

Advertising CTR prediction based on FDNN

adopted to optimize the parameters of FM. The numerical features are used directly and the categorical features are first represented with one-hot encoding. After many experiments, this paper adopts the following parameters: λ = 40, k = 30. C. GBDT+FM Solution Gradient Boost Decision Tree (GBDT) is a nonlinear model that is based on the concept of collective learning boosting. GBDT has strong extracting feature ability and nonlinear fitting ability, and it can flexibly handle various types of data in the advertising dataset including continuous value and discrete value, but no over-fitting occurs. Therefore, this paper selects GBDT+FM solution as a baseline model. Based on the work of Y Juan et al., this paper implements GBDT+FM solutions for performance comparison and analysis. First, in Preprocessing-A, all numerical features are used directly and all categorical features are represented with one-hot encoding. Then, these features are fed into decision trees in GBDT to generate GBDT features. This paper selects Xgboost to implement the GBDT [48]. In Preprocessing-A, the number of decision trees in GBDT is 30 and the depth is 7. Thirty features are generated for each impression. The following process is Preprocessing-B, which is used for generating input features for FM. In Preprocessing-B, numerical features (I1-I13) with values greater than 2 are transformed by v blog(v)2c; categorical features (C1-C26) that appear fewer than 10 times are transformed into a special value; and GBDT features are directly included. Therefore, each record contains 69 features: 13 numerical features, 26 categorical features and 30 GBDT features. Then, these three groups of features are hashed into 1 M dimensions using a hashing trick. Finally, the processed features are fed into the FM model for CTR prediction. The parameters of the FM model that achieve the relatively optimal results are “λ = 40, k = 100”. Experimental results and analysis. According to S9 and S10 Figs, the proposed FDNN method has the best performance and LR has the worst performance. In the process of training models, it can be found that logistic regression model can be trained rapidly, but its effect is in the general level. One reason is that when LR is used to predict CTR, it requires the experts with sophisticated experience to implement feature engineering through the human labor. The other reason is that the simple structure of LR restrains its representation ability. These two disadvantages cause that LR cannot learn complex feature interactions and is insensitive to outliers and missing values. Therefore, LR has the relatively bad performance in the experiment. According to S9 and S10 Figs, FM is superior in performance to LR. The reason is that FM extends the idea of LR because FM adds to second order fitting on the basis of first order fitting and it can automatically learn cross features of any two features from different dimensions. And cross features are expressed with embedding vector. Compared with LR, FM model has fewer manual intervene but higher efficiency. Therefore, FM has better representation ability than LR in the experiment. However, FM also cannot learn much more complex feature interactions because of its relatively simple structure. In this experiment, GBDT+FM solution outperform the FM model. The GBDT+FM solution obtains good prediction results, which are slightly inferior to those of the proposed method. The GBDT can learn useful high-order feature interactions because it has strong nonlinear fitting ability and the robustness for hyperparameter, so it can efficiently extracts discriminative features and cross-features of advertising dataset, and shows better performance than FM model in our CTR prediction experiment. Compared with the GBDT+FM solution, the proposed method achieves more than 0.09% improvement in terms of AUC (0.01% in terms of log-loss) on the dataset. The reasons are as follows. The first is that the quality of data exerts a great influence on effect of model. Because the dataset contains many noisy data, there are bottlenecks in the effects of GBDT+FM

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

19 / 24

Advertising CTR prediction based on FDNN

solution. Compared with this model, the proposed FDNN model, as a result of being endowed with fuzzy technology, can capture more-complex high-order feature interactions of the advertising dataset, which contains many outliers and missing values. The second is that GBDT+FM solution, belonging to a kind of stacking combined model without backproportion, is not always better than the proposed model with backproportion in effect improvement for CTR prediction from the theoretical perspective. Therefore, it demonstrates relatively good performance in terms of both data representation capability and prediction accuracy. In a real production environment for an advertising system, a small improvement in advertising CTR is likely to result in a significant increase in financial income for the internet company and an improved user experience.

Discussion This paper proposes a fuzzy deep neural network (FDNN) to address the problem of advertising CTR prediction. The network’s performance is compared with those of several baseline models in real advertisement datasets. Due to the fuzziness that is introduced into the proposed model, it can learn the most useful high-order feature interactions and demonstrates good performance in terms of both data representation capability and robustness in advertisement datasets with noise. Big data bring new perspectives to these fields and challenges in processing and discovering valuable information. The traditional artificial intelligence techniques can no longer effectively extract and organize the discriminative information from raw advertisement data directly [49]. This paper provides a basis for the further application of fuzzy deep neural networks in advertising CTR prediction. Future work will focus on exploring additional unsupervised feature learning and deep architecture models to improve the CTR prediction for internet advertising. In addition, in big data era, a real production environment for advertising CTR prediction requires that the advertising system return the prediction results in milliseconds or even microseconds. Determining how to implement low-lag and high-efficiency impromptu AdHoc analysis to predict and return results based on a big dataset is a great challenge for big data processing systems. Big data systems with stream computing, such as Spark Streaming, Storm, and Flink, can implement analysis and query in real time. However, because of the limited memory capacity and loss of raw historical data in system, Ad-Hoc analysis and query for a big dataset cannot be implemented. Therefore, exploring CTR prediction solutions based on streaming big data processing technology is another future direction of the work of this paper.

Supporting information S1 Fig. Triangular fuzzy number. (ZIP) S2 Fig. Structure of fuzzy restricted boltzmann machine. (ZIP) S3 Fig. Structure of fuzzy gaussian-bernoulli restricted boltzmann machine. (ZIP) S4 Fig. Modeling click-through rate prediction by means of LR. (ZIP) S5 Fig. Architecture of the FDNN model and its process of predicting CTR. (ZIP)

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

20 / 24

Advertising CTR prediction based on FDNN

S6 Fig. The process of constructing FDNN. (ZIP) S7 Fig. AUC values of FDNN and DBNLR with different numbers of layers and nodes. (ZIP) S8 Fig. Log-loss values of FDNN and DBNLR with different numbers of layers and nodes. (ZIP) S9 Fig. Comparison of AUC among all models on the test dataset. (ZIP) S10 Fig. Comparison of log-loss among all models on the test dataset. (ZIP) S1 Table. Data format of dataset. (ZIP) S2 Table. Experimental environment. (ZIP) S3 Table. Confusion matrix. (ZIP) S4 Table. Configure information of FDNN and DBNLR with different numbers of layers and nodes. (ZIP)

Acknowledgments This work is supported by the Natural Science Research Project of Education Department of Guizhou Province of China (No.KY2017350), the Project Funds of the Digital Engineering Research Center for Local National Culture of Universities in Guizhou (No.KY2013227) and the Scientific Research Project of Qiannan Normal University for Nationalities (No. qnsy2017006). The authors are thankful for the research work in all referenced manuscripts.

Author Contributions Formal analysis: Zilong Jiang, Shu Gao. Funding acquisition: Mingjiang Li. Investigation: Shu Gao, Mingjiang Li. Methodology: Zilong Jiang. Project administration: Mingjiang Li. Software: Zilong Jiang. Writing – original draft: Zilong Jiang. Writing – review & editing: Zilong Jiang.

References 1.

Sami Najafi-Asadolahi, Fridgeirsdottir Kristin. Cost-Per-Click Pricing for Display Advertising. Journal Manufacturing & Service Operations Management. 2014; 16(4):482–497. http://doi.org/10.1287/ msom.2014.0491

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

21 / 24

Advertising CTR prediction based on FDNN

2.

Z Pan, E Chen, Q Liu, T Xu, H Ma, H Lin. Sparse Factorization Machines for Click-through Rate Prediction. IEEE 16th International Conference on Data Mining (ICDM 2016), Barcelona, Spain, 2016; pp.400–409.

3.

Chapelle Olivier, Manavoglu Eren, Ro´mer Rosales. Simple and scalable response prediction for display advertising [J]. ACM Transactions on Intelligent Systems and Technology, 2014, 5(4):1–34. http://doi. org/10.1145/2532128

4.

H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, et al. Ad click prediction: a view from the trenches. Proceedings of the 19th ACM international conference on Knowledge discovery and data mining (ACM SIGKDD 2013), New York, USA, 2013; pp.1222– 1230.

5.

Anh-Phuong TA. Factorization Machines with Follow-The-Regularized-Leader for CTR prediction inDisplay Advertising. 2015 IEEE International Conference on Big Data (IEEE BigData 2015). Santa Clara, CA, USA, 2015; pp. 2889–2891.

6.

Afroze Ibrahim Baqapuri, Ilya Trofimov. Using Neural Networks for Click Prediction of Sponsored Search. arxiv preprint arXiv; 1412.6601, 2014.

7.

Kushal Dave, Vasudeva Varma. Predicting the Click-Through Rate for Rare/NewAds. Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad, INDIA, April 2010, pp.1–18.

8.

Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, Chih-Jen Lin. Field-aware Factorization Machines for CTR Prediction. ACM Conference on Recommender Systems (ACM RecSys 2016), Boston, United States, 2016; pp. 43–50.

9.

Y Zhang, H Dai, C Xu, J Feng, T Wang, J Bian, et al. Sequential Click Prediction for Sponsored Search with Recurrent Neural Networks. 28th AAAI Conference on Artificial Intelligence (AAAI 2014), Que´bec City, Que´bec, Canada, 2014; pp.1369–1375.

10.

Yu Shimin. Prediction of ads’click through rate based on recurrent neural network. Hangzhou: Zhejiang sci-tech university, 2016; pp.12–15.

11.

Jiang Z, Gao S and Dai W. Research on CTR Prediction for Contextual Advertising Based on Deep Architecture Model. Control Engineering and Applied Informatics. 2016; 18(1):11–19.

12.

Junxuan Chen, Baigui Sun, Hao Li, Hongtao Lu, Xian-Sheng Hua. Deep CTR Prediction in Display Advertising. Proceedings of the 24th ACM on Multimedia Conference (ACM MM 2016), Amsterdam, Netherlands, 2016; pp.811–820.

13.

Wang H, Xu Z, Pedrycz W. An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities. Knowledge-Based Systems, 2017, 118:15–30. http://doi.org/ 10.1016/j.knosys.2016.11.008

14.

Xu Zeshui. A Method for Priorities of Triangular Fuzzy Number Complementary Judgement Matrices. Fuzzy Systems & Mathematics, 2002; 16(1):47–50.

15.

Asghari-Larimi M. Upper and lower (alpha, beta)-intuitionistic fuzzy set. International Mathematical Forum, 2012(9–12); pp.531–538.

16.

Hamrawi Hussam, Coupland Simon, John Robert. Type-2 Fuzzy Alpha-Cuts. IEEE Transactions on Fuzzy Systems, 2017; 25(3):682–692. http://doi.org/10.1109/TFUZZ.2016.2574914

17.

Dutta Soma and ChakrabortyMihir K. Fuzzy relation and fuzzy function over fuzzy sets: A retrospective. Soft Computing, 2015; 19(1):99–112. http://doi.org/10.1007/s00500-014-1356-z

18.

Klimt Gustav, Buckley James J. Fuzzy Probabilities: New Approach and Applications, NewYork: Physica-Verlag, 2003; 20 (3):363–372.

19.

Runkler T.A., Glesner Manfred. DECADE—fast centroid approximation defuzzification for real time fuzzy control applications. DBLP, 1994:161–165.

20.

Chen C.L. Philip, Zhang Chun-Yang, Chen Long, and Gan Min. Fuzzy Restricted Boltzmann Machine for the Enhancement of Deep Learning. IEEE Transactions on Fuzzy Systems, 2015; 23(6):2163– 2173. http://doi.org/10.1109/TFUZZ.2015.2406889

21.

Gelfand Alan E. Journal of the American Statistical Association, 2000, vol. 95, no. 452, pp.1300–1304. http://doi.org/10.1080/01621459.2000.10474335

22.

Hinton Geoffrey E., Osindero Simon, Teh Yee-Whye. A fast learning algorithm for deep belief nets. Neural Computation, 2006; 18(7):1527–1554. http://doi.org/10.1162/neco.2006.18.7.1527 PMID: 16764513

23.

Hyun-Chul Kim, Jong-Hwan Lee. Evaluation of weight sparsity regularizion schemes of deep neural networks applied to functional neuroimaging data. The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, 2017; pp.6150– 6154.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

22 / 24

Advertising CTR prediction based on FDNN

24.

Osman Taher, Divigalpitiya Prasanna, Arima Takafumi. Driving factors of urban sprawl in Giza governorate of the Greater Cairo Metropolitan Region using a logistic regression model. International Journal of Urban Sciences, 2016; 58:1–20.

25.

Shrestha Ravi, Mohammed Shahed Khan, Hasan Md. Mehedi, Wahid Khabou. Automated Adaptive Brightness in Wireless Capsule Endoscopy Using Image Segmentation and Sigmoid Function. IEEE Transactions on Biomedical Circuits & Systems. 2016; 10(4):884. http://doi.org/10.1109/TBCAS. 2016.2546838

26.

Yann LeCun, Bengio Yoshua, and Hinton Geoffrey. Deep learning, Nature, 2015; 521(7553): 436–444. http://doi.org/10.1038/nature14539

27.

Horikawa SI, Furuhashi T, Uchikawa Y. On fuzzy modeling using fuzzy neural networks with the backpropagation algorithm. IEEE Transactions on Neural Networks, 1992; 3(5):801–806. http://doi.org/10. 1109/72.159069

28.

Mosca A, Magoulas GD. Advances in Computational Intelligence Systems. New York: Springer Publishing Company, 2017; pp.433–445.

29.

Dreyfus SE. Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure. Journal of Guidance Control & Dynamics, 2012; 13(5):926–928.

30.

Kaggle Display Advertising Challenge Dataset.[EB/OL]. http://labs.criteo.com/2014/02/kaggle-displayadvertising-challenge-dataset/

31.

Triba MN, Le ML, Amathieu R, Goossens C, Bouchemal N. PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Molecular Biosystems, 2015; 11(1):13. http://doi.org/10.1039/C4MB00414K

32.

Chen C, Hou C, Xiao J, Yuan X. Purchase Behavior Prediction in E-Commerce with Factorization Machines. IEICE Transactions on Information and Systems, 2016, E99.D(1), pp.270–274. http://doi. org/10.1587/transinf.2015EDL8116

33.

Akimova Tatiana, Levine Matthew H., Beier Ulf H., Hancock Wayne W. Standardization, Evaluation, and Area-Under-Curve Analysis of Human and Murine Treg Suppressive Function. Suppression and Regulation of Immune Responses. 2016; vol.1371, pp.43–78. http://doi.org/10.1007/978-1-49393139-2_4

34.

Alberto Jime´nez-Valverde. Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure in species distribution modeling. Global Ecology and Biogeography. 2012; 21(4):498–507. http://doi.org/10.1111/j.1466-8238.2011.00683.x

35.

Deng X, Liu Q, Deng Y, Mahadevan S. An improved method to construct basic probability assignment based on the confusion matrix for classification problem, Information Sciences, 2016; 340: 250–261.

36.

Ohsaki M, Wang P, Matsuda K, Katagiri S, Watanabe H, Ralescu A. Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification. IEEE Transactions on Knowledge & Data Engineering, 2017; 29(9):1806–1819. http://doi.org/10.1109/TKDE.2017.2682249

37.

Talabani A. Jamal, Endreseth B.H., Lydersen S., Edna T.-H. Clinical diagnostic accuracy of acute colonic diverticulitis in patients admitted with acute abdominal pain, a receiver operating characteristic curve analysis. International Journal of Colorectal Disease. 2017; 32(1):41–47. http://doi.org/10.1007/ s00384-016-2644-0

38.

sklearn.metrics.log_loss.[EB/OL]. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_ loss.html.

39.

Logarithmic Loss http://www.Kaggle.com/wiki/LogarithmicLoss.

40.

sklearn.feature_extraction. FeatureHasher.[EB/OL]. http://scikit-learn.org/stable/modules/generated/ sklearn.feature_extraction.FeatureHasher.html.

41.

Theano.[EB/OL]. http://deeplearning.net/software/theano/.

42.

Gao W, Zhou Z. Dropout Rademacher complexity of deep neural networks. Science China Information Sciences, 2016; 59(7):072104. http://doi.org/10.1007/s11432-015-5470-z

43.

Z Li, B Gong, T Yang. Improved Dropout for Shallow and Deep Learning. The 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016; pp. 1–8.

44.

liblinear.[EB/OL]. http://www.csie.ntu.edu.tw/cjlin/liblinear/.

45.

R.J. Oentaryo, E.P. Lim, J.W. Low, D. Lo, M. Finegold. Predicting response in mobile advertising with hierarchical importance-aware factorization machine. The Seventh ACM International Conference on Web Search & Data Mining (ACM WSDM 2014), New York, USA, 2014; pp. 123–132.

46.

Rendle Steffen. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, 2012; 3(3): Article 57(1–22). http://doi.org/10.1145/2168752.2168771

47.

libfm.[EB/OL]. http://www.libfm.org/.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

23 / 24

Advertising CTR prediction based on FDNN

48.

Xgboost. Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more.[EB/OL]. http://github.com/dmlc/xgboost.

49.

Lei Y, Jia F, Lin J, Xing S, and Ding SX. An Intelligent Fault Diagnosis Method Using Unsupervised Feature Learning Towards Mechanical Big Data. IEEE Transactions on Industrial Electronics, 2016; 65 (5):3137–3147.

PLOS ONE | http://doi.org/10.1371/journal.pone.0190831 May 4, 2018

24 / 24

Loading...