In this course, and in your careers, you will be faced with questions and with data. Your data workflow is how you apply the data to those questions.
We are going to review three workflows. A "Data Scientist", a "Statistician", and a "Managerial Accountant".
Second Step: Statistical Modeling fit a model to the data.
Third Step: Use the graph to understand the model Are there problems that we missed before? Can we make meaningful inferences? Can we use this model as a planning tool?
Fourth Step: Improve the model and gather more data
This is the cost object. You may need more than one, and you need at least one per cost center
Cost center: a part of the company where costs are tracked independently. From the perspective of the cost system a cost center is a "firm-within-the-firm".
Consider a merger between the packaging and transportation divisions of the company. The easiest way to begin costing in the combined firm is to treat them as separate! This is the basic idea behind a cost center. As synergies arise, resources will be shared and we may begin to combine parts of the two companies.
These are the independent variables.
In stage 2 we focused on what drives cost. Here we are gathering information about the direction, magnitude, and variability of these relationships.
If some are mixtures of fixed and variable costs, separate them where possible.
The first steps are management accounting. But here this is pure data science and statistics, but that doesn't mean its overly complicated. This step lets us make sure that our costing system is functioning the way that we think it should.
This is as simple as dumping all the data you have into a spreadsheet, or data frame and starting to plot things. This is what we did in the assignment!
# quick plotting examples here
This allows us to raise several questions:
Explore the impact on rates of differing allocation bases or cost behavior assumptions if these are in dispute. This often requires generating standard costs to explore the impact on product prices. This process leads to the next step.
# quick plotting examples here
The parameters of the cost function will be used to generate budgets (they are called "standard costs"
Before settling on a final cost system test the standard costs that arise from merging cost centers (e.g., departments, plants, etc.). This often results in a much simplified design. When establishing or reviewing a cost system highest level of disaggregation may be used to test the accuracy of these aggregate designs.
Stages 4 & 5 developed various alternative approaches for the firm. Here we weigh the costs and benefits of the systems. Remember, complexity maybe fun to develop and explore, but not as fun to maintain and use!
Trevor Kennedy, the cost analyst at United Packaging's can manufacturing plant, is seeking to develop a cost function(s) that relates engineering support (E.5.) costs to machine hours. These costs have two components: (1) labor (which is paid monthly) and (2) materials and parts (which are purchased from an outside vendor every three months). He collects the following monthly data from the accounting records kept at the factory:
Month | Labor | Materials | Machine Hours |
---|---|---|---|
March | \$347 | \$847 | 30 |
April | 521 | 0 | 63 |
May | 398 | 0 | 49 |
June | 355 | 961 | 38 |
July | 473 | 0 | 57 |
August | 617 | 0 | 73 |
September | 245 | 821 | 19 |
October | 487 | 0 | 53 |
November | 431 | 0 | 42 |
Dependent Variable | Cost Driver | |
---|---|---|
(y-axis) | (x-axis) | |
(i) | Labor costs | Machine Hours |
(ii) | Materials costs | Machine Hours |
(iii) | Total costs | Machine Hours |
You can think of these plots as plotting cost as a function of machine hours. So the data are telling us something about three cost functions.
import pandas as pd
from plotnine import *
p2df=pd.read_csv("data2p.csv")
p2df['Total Cost'] = p2df['Labor'] + p2df['Materials']
p2df.columns
Index(['Month', 'Labor', 'Materials', 'Machine Hours', 'Total Cost'], dtype='object')
(ggplot(
p2df, # this is the dataset
aes(x='Machine Hours', y='Labor')) + # aes allows us to specify the vars
geom_point(colour="red") # + stat_smooth(method='ols')
)
<Figure Size: (640 x 480)>
(ggplot(
p2df, aes(x='Machine Hours', y='Materials')) +
geom_point(colour="red") # + stat_smooth(method='ols')
)
<Figure Size: (640 x 480)>
(ggplot(
p2df, aes(x='Machine Hours', y='Total Cost')) +
geom_point(colour="red") # + stat_smooth(method='ols')
)
<Figure Size: (640 x 480)>
# plotting example in excel here
from scipy import stats
beta, alpha, r_2, p_value, std_err = stats.linregress(p2df['Total Cost'],p2df['Machine Hours'])
beta, alpha, r_2, p_value, std_err
(-0.030450162401429302, 69.11304512183276, -0.6583768837496257, 0.05384941097850732, 0.013157738835535001)
First, and most importantly, the materials data has a problem! Whoever reported the data indicated '0' when they should have reported nothing.
It is worth thinking about what resources we sacrifice when and how this relates to the cost system that we are building here. For example, we know that we sacrifice cash every three months in exchange for materials, but we don't know what we do with those materials. Do we consume them immediately? If we consume them immediately then when do they generate revenue? Depending on how this plays out we would want to gather more information to develop more informative data.
One potential solution would be to use a different driver for materials. For example, number of cans used seems like an natural alternative.
Please keep in mind that this is an open ended question and I have only provided one potential answer.
Robin Green, financial analyst at Central Railroad, is examining the behavior of monthly transportation costs for budgeting purposes. Transportation costs at Central Railroad are the sum of two types of costs: (a) operating costs (labor, fuel, and so on), and (b) maintenance costs (overhaul of engines and track, and so on). Green collects monthly data on (a), (b), and track miles hauled for that month. Track miles hauled are the miles docked by the engine that pulls the rail carriages. Monthly observations for the most recent year are:
Month | Operating Costs | Maintenance Costs | Miles |
---|---|---|---|
January | \$471 | \$437 | 3,420 |
February | 504 | 338 | 5,310 |
March | 609 | 343 | 5,410 |
April | 690 | 347 | 8,440 |
May | 742 | 294 | 9,320 |
June | 774 | 211 | 8,910 |
July | 784 | 176 | 8,870 |
August | 986 | 210 | 10,980 |
September | 895 | 282 | 4,980 |
October | 651 | 394 | 5,220 |
November | 481 | 381 | 4,480 |
December | 386 | 514 | 2,980 |
Central Railroad earns its greatest revenues carrying agricultural commodities such as wheat and barley.
(i) Operating costs = a + b (Track miles hauled)
(ii) Maintenance costs = a + b (Track miles hauled)
(iii) Total transportation costs = a + b (Track miles hauled)
Comment on the patterns in the three plots.
# plotting examples
Miles seems to yield a reasonable model of operating costs. Most of the data is close to the line and the distribution seems stable over the observations that we have. The September observation is interesting, we should find out what caused this. Do we this this was a one-time event? Or is there potential that this is an underlying pattern?
Maintenance costs are decreasing as mile increase. This suggest that there is something missing from our model. Questions that you could consider:
A final note:
Variables must only be in the model once, this creates a problem when two measures are highly correlated, or two variables measure the same thing slightly differently.
You don't have to discuss them all, and they don't all apply. Think about which problem are most likely to be active in our setting and how the workflows we discussed can help us address these issues.
# plotting examples