### Author: Jitesh Shah, Sr. Engineering Manager.

### Company: Integrated Device Technologies

So you are running the procurement operations within your organization, and every time a request for a quote is sent to price a given product option, you get feedback from, say, 5 different vendors (in an email or excel form, of course). Provided your past experience with each of the vendors was positive with respect to product quality, reliability and on-time delivery, the order usually goes to the lowest bidder. But how do you know that you got the best price? What if the lowest bid was higher than what it should be based on what you might have paid for a similar product six months back?

The sheer number of variables and processes involved in manufacturing a semiconductor chip can leave the entire process of vendor pricing in disarray, especially when done manually, as it would be downright impossible to accurately predict what a product should cost without a formal, data-based approach. And it’s a near certainty every time you go through that RFQ process that you will overpay, unless you can scan through thousands of quotes or correctly calculate the influence of the many cost drivers.

Harnessing the data you already collect can have an immediate impact on your bottom line. Besides product pricing, companies can actively use data to better forecast inventory turns, design better processes and optimize product yields. This write-up focuses on managing vendor pricing and establishing a framework around the best practices in collecting the “right” data to build a self-aware, self-evolving model that wakes up and updates itself every time new data trickles in.

The First Steps…

Understanding the major influencers of a product’s cost is the ideal first step because that will help in defining the basic framework to eventually form relations between these influencers and the cost of a product. Process features, vendors, manufacturing locations, RFQ date and the mix of raw materials could be some of the cost drivers, but there could be other associated factors only an “expert” would know, and that domain knowledge is often critical in building a robust and reliable model. Or one can include everything in sight and the kitchen sink, but then you need to be able to weed out variables that simply correlate with the cost instead of logically cause cost to change (that classic correlation vs. causation problem). And then you have to worry about things like multicollinearity, where you have to decide whether to keep or remove variables on the right hand side of the model equation that exhibit strong correlations with each other. Oftentimes its best to remove these variables, but sometimes keeping them might make more sense if both of these factors explain a component of cost that is not explained by the other–though one has to live with a relatively higher standard error for the estimated parameters. But the biggest concern most times with these analyses is that you fail to properly account for one or more critical variables, and that causes the extracted parameter estimates to be somehow biased. A careful analysis of the residuals should make that evident and hopefully can be corrected before conclusions are drawn from the derived model. But there are many other intertwined steps before developing a fully functional model, the most important being deciding on the basic structure. A general framework relating cost to all these factors is shown below.

Functional Form

With the dataset ready and the key variables identified, a critical next step is to identify how the variables on the right side of the equation mathematically relate to cost, a dependent variable on the left.

There are in general three basic forms of this mathematical relationship between the variables on the two sides: linear, power and exponential (shown below).

The choice between the three functional forms will come down to the model that best fits the data. And that goodness-of-fit criteria should apply not only to within sample data points but also to out of sample data points.

One additional step is required to get to a point where one can interpret the meanings of the alphas and betas, and that aspect of the relationship will be discussed later, but C as we know is the predicted product cost, Ax’s are variables that are continuous in nature describing the attributes and the level of those attributes and Dx’s represent the various dichotomous variables like vendor, manufacturing location, the year that cost was paid and many other such variables that are digital in nature.

But why does the variable year have to be dichotomous with a new variable defined for each year? We know time is an important variable because costs paid five years back would change due to normal inflationary pressures and technological development. The time variable will account for these changes and is a critical component of any cost estimates in the future. But why not use a single variable for time instead of using multiple dichotomous variables? The question is valid but the configuration of this variable will depend on how statistically different the two approaches are. This basically means that a test needs to be designed to prove that the coefficients for each year are statistically the same and one can scale the time up and down using that single coefficient. Or use a different variable for each year.

Interpreting The Continuous Variables…

If the relationship ends up being linear in nature then the interpretation of the estimated parameters is pretty straightforward. Alpha_1 below represents the change in product cost with one unit change in the value or the level of attribute A1, all other factors remaining constant.

In other words, alpha_1 is the change in product cost due only to a one unit change in the value of attribute A1. The same thing goes for alpha_2, alpha_3 and so on.

That was for the linear case but what about interpreting the continuous variables in the power or exponential case? The fundamental assumption of an OLS regression is that the parameter estimates have a linear relationship with the dependant variable on the left. That is not the case with the power or the exponential model. But a quick math trick and you are back to a linear setup as shown for the power model below.

Since logs measure changes in percents instead of unitary changes, alpha_1 for the log-log case now represents the percent change in product cost due and only due to a one percent change in the value of attribute A1.

As before, the exponential case also requires transformation where now the estimates represent a linear relationship to the dependant variable on the left as shown below.

Alpha_1 in this case represents the percent change in product cost due and only due to a one unit change in the level of attribute A1.

Interpreting The Dichotomous Variables…

These variables represent digital events instead of the more analog-type continuous variables. What we mean by that is that if a dummy variable is used representing a vendor, for example, it will take on a value of 1 if that vendor is used or will take on a value of 0 if that vendor is not used. So it holds a value of 1 or a 0 with nothing in between.

Now on to interpreting the parameters. Assume that we are trying to find out what the manufacturing cost difference would be between three different vendors for the same exact product i.e., the attribute variables represented by A1 and A2 in the example below are held constant and beta_1 estimate represents the change in cost for vendor 1, beta_2 for vendor 2 and beta_3 for vendor 3.

If D2 = 1 and D3 = 0, then the product was built by vendor 2. If D2 = 0 and D3 = 1, then the product was built by vendor 3 and if D2 = 0 and D3 = 0, then the product was built by vendor 1. But the dummy variable D1 is redundant, and to show the cost differences between the three vendors, only two dummies are required and the equation reduces to the one shown below.

And the cost relationship between the three vendors with the same attributes…

So what is the expected cost difference for the same product built by vendor 2 instead of vendor 1? That’s beta_2. And between vendor 3 and vendor 1? That’s beta_3.

A similar process can be used to interpret other dummy variables. So that was for a linear model but if the model form happens to be of a log-log type, the attributes and the dummy variables will be represented differently as shown below.

The cost models for the three vendors for a product with the same exact attributes in the log- log case would be as shown below.

The difference in cost between vendor 1 and 2 can then be easily calculated by using the transformations below. Also shown is the cost difference between vendor 2 and 3 that differs a little but the math is pretty straightforward.

The interpretation in the log-linear case for the dummy variables is precisely the same as the log-log case, and, as one can notice, the cost difference between vendors is expressed in percent form.

One last thing: To build some kind of automation in the system, the parameter estimates and the choice between the model forms can be left open to self-update and evolve as data from new RFQs trickle in. It would be difficult to completely automate this process because, with time, products and technology change, and the factors that contribute to the final cost will require reconsideration.

So there you have it—a framework and a tool to help you understand the influence and the contribution of each and every variable that is part of the final product, and to enable you to not only compare costs between the past and the present, but between vendors and across attributes. This framework implemented correctly will have an immediate and direct impact on your bottom line. We guarantee it.