Here are 3 tips you should keep in mind when working with Redshift as your team's data-warehouse:
]]>Here are 3 tips you should keep in mind when working with Redshift as your team's data-warehouse:
1) Primary key and Foreign Key Constraints are not enforced
Redshift only uses primary and foreign keys as planning hints for certain statistical computations. Your ETL logic or source tables should be enforcing these constraints. Redshift is essentially a distributed column store and the main reason why it's able to batch writes and massive select and joins so quickly is because it doesn't need to handle uniqueness constraints from primary keys and indexes.
2) SQL Three-Valued Logic
This one isn't specific to Redshift, but is more of a tricky SQL gotcha. Lets say you have this sample query:
SELECT * from some_table WHERE name NOT IN ('a', 'b', 'c')
where name
is nullable. Would you expect this to returns results where name
is NULL
? The query will not return any rows where name is NULL
due three-valued logic. NULL
is a special case where we designate data is absent and missing. From the context of a boolean expression, if we have the expression:
NULL != 'a'
Three-value logic defines the result of this as unknown
since we don't actually know the value of the data. WHERE
clauses must require conditions to evaluate to true
.
The correct query would then be:
SELECT * from some_table WHERE name IS NULL OR name NOT IN ('a', 'b', 'c')
3) ISO-8601 and Timestamp Precision
An easy mistake to make with storing timestamps is to assume that storing timestamps as ISO-8601 strings, will automatically make it convertible between different databases, languages etc...
For example with the Go standard library supports nano-second precision:
package main
import (
"fmt"
"time"
)
func main() {
t := "2020-07-01T01:23:45.999999999Z"
res, err := time.Parse(time.RFC3339Nano, t)
if err == nil {
fmt.Println(res)
}
}
2020-07-01 01:23:45.999999999 +0000 UTC
Program exited.
If you stored this into Redshift and tried to cast it back into a DATE with something like created_at::date
or TIMESTAMP you'd get the following error:
org.postgresql.util.PSQLException: ERROR: date/time field value out of range
Looking at AWS's documentation we see that Redshift only supports 1 microsecond of precision!
To prevent issues like this, teams should ensure that their ETL code understands the source table schemas and do the appropriate transformations. Specifically for DynamoDB, even though DynamoDB itself only supports microsecond resolution, you should defensively check the string's precision with a rtrim
or use to_date(created_at, 'YYYY-MM-DDTHH:MI:SS')
.
Teams should also access what type of precision they need with their data. In most cases, it's extremely rare that your system will actually need to be measuring time at a nanosecond precision. At those levels, you're at the mercy of system clock errors, clock drift, etc...
]]>Alot of the crypotgraphic software we use relies on hashing and it's uniqueness guarantees to provide core functionality, such asauthentication, message integrity, message fingerprinting, data corruption detection, etc... Guaranteeing this uniqueness however is fundamentally impossible following from the Pigeonhole Principle if your input data size is going to be larger
]]>Alot of the crypotgraphic software we use relies on hashing and it's uniqueness guarantees to provide core functionality, such asauthentication, message integrity, message fingerprinting, data corruption detection, etc... Guaranteeing this uniqueness however is fundamentally impossible following from the Pigeonhole Principle if your input data size is going to be larger than your maximum size of your key space.
At some point you're guaranteed to have two different sets of data that will map to the same key. So just how frequently will these types of collisions occur?
Given n randomly generated values, where each is a k-bit integer, what is the probability that any two of them are equal?
To calculate this probability, it might be simpler to first look at the probability of no collisions happening, since:
$$P(Any~Collision) = 1 - P(No~Collisions)$$
If we have a k-bit integer, that means it's maximum value is N = 2^k
. The probability of one number not colliding with another would be:
$$N-1/N$$
since we have N-1
remaining that are unique. If we generate another value, the probability then becomes:
$$N-1/N \times N-2/N$$
by the multiplicative rule of probability. Extending this to n
different values, we get:
$$N-1/N \times N-2/N \times ... \times N-n/N$$
Factoring out the N to simplify:
$$e^{1-1/N} \times e^{1-2/N} ... \times e^{1-n/N} $$
If N >> n^2
, we can apply the approximation that 1-x < e^-x
:
$$e^{-1/N} \times e^{-2/N} \times ... \times e^{-n/N} $$
Simplifying the products we get:
$$e^{-1/N\sum_{j=0}^{n-1}j}$$
And reducing the sum we get:
$$e^{\frac{-n(n-1)}{2N}}$$
So now that we have a usuable approximation for the probability of a hash collision not happening, to the probability of a collision does happen becomes:
$$P(Any~Collision) = 1 - e^{\frac{-n(n-1)}{2N}}$$
With this we can get a general sense of just how often we can expect a collision given a certain value size.
Bits | Output Size | # of hashes to get for 50% chance of collsion |
---|---|---|
16 | ~6.5 x 104 | 300 |
32 | 4.3 × 10^9 | 77,000 |
64 | ~1.8 × 10^19 | 5.1 × 10^9 |
128 | ~3.4 × 10^38 | 2.2 × 10^19 |
256 | ~1.2 × 10^77 | 4.0 × 10^38 |
512 | ~1.3 × 10^154 | 1.4 × 10^77 |
Another FiveThirtyEight Riddler.The President has bestowed upon you 1 billion dollars with the mission of getting us to an alien artifact as fast as possible! Here's whats available to you:
Big Russian engines costing 400 million each. Buying one will reduce the trip time by 200 days. Buying two
Another FiveThirtyEight Riddler.The President has bestowed upon you 1 billion dollars with the mission of getting us to an alien artifact as fast as possible! Here's whats available to you:
Big Russian engines costing 400 million each. Buying one will reduce the trip time by 200 days. Buying two will save another 100 days.
NASA ion engines. There are only eight of these 150 million large-scale engines in the world. For each 150 million fully fueled xenon engine you buy, you can take 50 days off of the trip.
Light payloads send ahead of time. For 50 million each, you lighten the main mission and reduce the arrival time by 25 days.
What's the best strategy?
This boils down to an integer programming problem, where we have several variables that with several constraints (in our case the 1 billion dollar budget) and are integer-valued, and we want to maximize some objective function, (in our case reducing our arrival time). So we get two equations:
Budget Constraint:
$$400x_0+800x_1+150x_2+50x_3 \leq 1000$$
Maximizing arrival time:
$$ max(200x_0 + 300x_1 + 50x_2 + 25x_3) $$
And scipy has a solver!
from scipy.optimize import linprog
c = [-200, -300, -50, -25]
A = [400, 800, 150, 50]
b = [1000]
x0_bounds = (0, 1)
x1_bounds = (0, 1)
x2_bounds = (0, 8)
x3_bounds = (0, None)
res = linprog(
c,
A_ub=A,
b_ub=b,
bounds=(x0_bounds, x1_bounds, x2_bounds, x3_bounds),
options={"disp": True}
)
print res
Optimization terminated successfully.
Current function value: -500.000000
Iterations: 4
fun: -500.0
message: 'Optimization terminated successfully.'
nit: 4
slack: array([ 0., 0., 1., 8.])
status: 0
success: True
x: array([ 1., 0., 0., 12.])
So the solution is to buy one Russian rocket and send out 12 light payloads ahead of time, and we save 500 days!
]]>Another weekly FiveThirtyEight riddler, and this one's a mind-bender.
]]>Three very skilled logicians are sitting around a table — Barack, Pete and Susan. Barack says: “I’m thinking of two natural numbers between 1 and 9, inclusive. I’ve written the product of these two numbers on this paper that I’
Another weekly FiveThirtyEight riddler, and this one's a mind-bender.
Three very skilled logicians are sitting around a table — Barack, Pete and Susan. Barack says: “I’m thinking of two natural numbers between 1 and 9, inclusive. I’ve written the product of these two numbers on this paper that I’m giving to you, Pete. I’ve written the sum of the two numbers on this paper that I’m giving to you, Susan. Now Pete, looking at your paper, do you know which numbers I’m thinking of?”
Pete looks at his paper and says: “No, I don’t.”
Barack turns to Susan and asks: “Susan, do you know which numbers I’m thinking of?” Susan looks at her paper and says: “No.”
Barack turns back to Pete and asks: “How about now? Do you know?”
“No, I still don’t,” Pete says.
Barack keeps going back and forth, and when he asks Pete for the fifth time, Pete says: “Yes, now I know!”
First, what are the two numbers? Second, if Pete had said no the fifth time, would Susan have said yes or no at her fifth turn?
It appears that they both shouldn't have enough information to deduce what the two numbers are. How is it possible that only after five turns, they could exchange enough information to pinpoint the number pair?
When Pete first says 'I don't know', that means with the product he's been given isn't unique in the space of all possible product pairs from [1,9]. So Pete can eliminate all pairs that have unique products.
Sarah, following the same logic also doesn't see a unique sum, and can eliminate all pairs that have unique sums.
Iterate this for 5 turns and we get the following logic:
from collections import Counter
lower_bound = 1
upper_bound = 9
remaining_pairs = set((a, b) for a in range(lower_bound,upper_bound + 1) for b in range(a, upper_bound + 1))
for i in range(0,4):
print "Pete - 'I don't know'"
_prod_counts = Counter(a*b for a,b in remaining_pairs)
remaining_pairs = set((a,b) for a,b in remaining_pairs if _prod_counts[a*b]>1)
print "Sarah - 'I don't know'"
_sum_counts = Counter(a+b for a,b in remaining_pairs)
remaining_pairs = set((a,b) for a,b in remaining_pairs if _sum_counts[a+b]>1)
_prod_counts = Counter(a*b for a,b in remaining_pairs)
print "Pete - 'I know its: {}'".format(set((a,b) for a,b in remaining_pairs if _prod_counts[a*b] == 1))
Running this we get:
Pete - 'I don't know'
Sarah - 'I don't know'
Pete - 'I don't know'
Sarah - 'I don't know'
Pete - 'I don't know'
Sarah - 'I don't know'
Pete - 'I don't know'
Sarah - 'I don't know'
Pete - 'I know its: set([(2, 8)])'
And in the case where Pete doesn't know in round 5, Sarah also won't know what the pair will be.
]]>The solution for this weeks FiveThirtyEight riddler is pretty nutty.
The riddle is essentially:
If you drew random numbers from a uniform distribution from [0,1], what is the expected number of draws
you would perform until their sum exceeds 1?
So let us define $S_{n}$ as the sum
]]>The solution for this weeks FiveThirtyEight riddler is pretty nutty.
The riddle is essentially:
If you drew random numbers from a uniform distribution from [0,1], what is the expected number of draws
you would perform until their sum exceeds 1?
So let us define $S_{n}$ as the sum of the $nth$ number drawn. Then the probability of the first sum $S_{1}$ being less than some value $x$ is:
$$P(S_{1}\leq x) = x$$
Then probability of the second sum $S_{2}$ being less than $x$ can be calculated by taking $P(S_{1}\leq x)$ as a conditional distribution on $P(S_{2}\leq x)$ and calculating the marginal prob.:
$$P(S_{2}\leq x) =\int_0^1{P(S_{1}\leq x)}\cdot 1\ dx$$
$$ =\int_0^1 x\ dx$$
$$ =\frac{x^2}{2}$$
Probability of the $3rd$ is:
$$P(S_{3}\leq x) =\int_0^1{P(S_{2}\leq x)}\cdot 1\ dx$$
$$ =\int_0^1 \frac{x^2}{2}\ dx$$
$$ =\frac{x^3}{6}$$
Extending to $n$:
$$P(S_{n}\leq x) =\frac{x^n}{n!}$$
Since we're concerned with the sum exceeding $x=1$:
$$P(S_{n}\leq 1) =\frac{1}{n!}$$
To find the probability of exactly $n$ draws summing greater than 1, we have to find the probability of $n-1$'s sum exceeding one and subtract that off the probability of $n$'s sum exceeding.
$$P(N\ge n) = \frac{1}{n-1!} - \frac{1}{n!}$$
Then we can calculate the expectation on that distribution:
$$E[P(N\ge n)] = \sum_{n=2}^\infty n \cdot (\frac{1}{n-1!} - \frac{1}{n!})$$
$$ = \sum_{n=2}^\infty (\frac{n}{n-1!} - \frac{1}{n-1!})$$
$$ = \sum_{n=2}^\infty \frac{1}{n-2!}$$
$$ = \sum_{n=0}^\infty \frac{1}{n!}$$
$$ = e$$
The expected number of draws you need to exceed 1 is $e$!
The convergence to $e$ does make some intuitive sense. You'll almost assuredly never bust on the first draw, so the minimum # of draws you'll have to make is at least two. Since we're drawing from a uniform distribution from [0,1], we have a mean of $\frac{1}{2}$. 2x the mean gets us to 1, and 3x the mean moves us to 1.5, so a value in between 2-3 fits the bill.
]]>Since the 2013-14 season, the NBA has installed special cameras in every NBA stadium allowing teams to track every single player and ball movement, and digitize it into data that their data scientists can crunch into ever more sophisticated metrics on team and player performance.
What's even more amazing is
]]>Since the 2013-14 season, the NBA has installed special cameras in every NBA stadium allowing teams to track every single player and ball movement, and digitize it into data that their data scientists can crunch into ever more sophisticated metrics on team and player performance.
What's even more amazing is that alot of this data is now publicly accessible via stats.nba.com.
So how do we get to the data? Screen scrapping? Nope! If you pull up the your browser's debug console, and look at the network traffic, you'll find that stats.nba.com has very kindly exposed some endpoints with for their Angular app. For example, all common info for Kobe Bryant for the 2013-14 season:
http://stats.nba.com/stats/commonplayerinfo?LeagueID=00&PlayerID=977&SeasonType=Regular+Season&Season=2013-14
Response:
{"resource":"commonplayerinfo","parameters":[{"PlayerID":977},{"LeagueID":"00"}],"resultSets":[{"name":"CommonPlayerInfo","headers":["PERSON_ID","FIRST_NAME","LAST_NAME","DISPLAY_FIRST_LAST","DISPLAY_LAST_COMMA_FIRST","DISPLAY_FI_LAST","BIRTHDATE","SCHOOL","COUNTRY","LAST_AFFILIATION","HEIGHT","WEIGHT","SEASON_EXP","JERSEY","POSITION","ROSTERSTATUS","TEAM_ID","TEAM_NAME","TEAM_ABBREVIATION","TEAM_CODE","TEAM_CITY","PLAYERCODE","FROM_YEAR","TO_YEAR","DLEAGUE_FLAG"],"rowSet":[[977,"Kobe","Bryant","Kobe Bryant","Bryant, Kobe","K. Bryant","1978-08-23T00:00:00","Lower Merion HS (PA)","USA","Lower Merion HS (PA)/USA","6-6","212",18,"24","Guard","Active",1610612747,"Lakers","LAL","lakers","Los Angeles","kobe_bryant","1996","2014","N"]]},{"name":"PlayerHeadlineStats","headers":["PLAYER_ID","PLAYER_NAME","TimeFrame","PTS","AST","REB","PIE"],"rowSet":[[977,"Kobe Bryant","2014-15",26.4,4.1,5.1,0.119]]}]}
They even expose movement data for each "play" in a game.
http://stats.nba.com/stats/locations_getmoments/?eventid={}&gameid={}
Response:
"{"moments": [[1, 1415407240007, 719.12, 24.0, null, [[-1, -1, 48.30107, 33.09853, 10.3485], [1610612761, 200768, 76.01662, 25.29939, 0.0], [1610612761, 101161, 57.58085, 29.34887, 0.0], [1610612761, 201942, 50.654, 33.00764, 0.0], [1610612761, 203082, 51.8 (...)"
With this data we can do some cool things, like plot a density of plot of locations players tend to be. For example, lets look at this Wizards vs. Raptors game from Nov 7, 2014.
Demar Derozan
Kyle Lowry
As a point guard, his movement probably has alot of variability since he has to initiate the play.
Amir Johnson
Alot of movement down in the post, and mid-range areas as a power-forward should be.
Terence Ross
Ross loves his corner 3's
Lou Willams
Jonas Valanciunas
Center's have a pretty simple movement pattern, move from baseline to baseline
So here's something cool. This formula:
$$\frac{1}{2} < \lfloor mod(\lfloor\frac{y}{17}\rfloor2^{-17\lfloor x \rfloor -mod(\lfloor y\rfloor,17)},2)\rfloor$$
will literally plot itself
if you look at the right region.
where k is this 543-digit integer:
960 939 379 918 958
]]>So here's something cool. This formula:
$$\frac{1}{2} < \lfloor mod(\lfloor\frac{y}{17}\rfloor2^{-17\lfloor x \rfloor -mod(\lfloor y\rfloor,17)},2)\rfloor$$
will literally plot itself
if you look at the right region.
where k is this 543-digit integer:
960 939 379 918 958 884 971 672 962 127 852 754 715 004 339 660 129 306 651 505 519 271 702 802 395 266 424 689 642 842 174 350 718 121 267 153 782 770 623 355 993 237 280 874 144 307 891 325 963 941 337 723 487 857 735 749 823 926 629 715 517 173 716 995 165 232 890 538 221 612 403 238 855 866 184 013 235 585 136 048 828 693 337 902 491 454 229 288 667 081 096 184 496 091 705 183 454 067 827 731 551 705 405 381 627 380 967 602 565 625 016 981 482 083 418 783 163 849 115 590 225 610 003 652 351 370 343 874 461 848 378 737 238 198 224 849 863 465 033 159 410 054 974 700 593 138 339 226 497 249 461 751 545 728 366 702 369 745 461 014 655 997 933 798 537 483 143 786 841 806 593 422 227 898 388 722 980 000 748 404 719
So how exactly does this work? Let's try to decipher what's happening in the inequality.
Saying the floor of something is greater than a half just means its greater than one:
$$1 \leq mod(\lfloor\frac{y}{17}\rfloor2^{-17\lfloor x \rfloor -mod(\lfloor y\rfloor,17)},2)$$
If we let $y = 17q + r$ where $0 \leq r < 17$, the expression simplifies to:
$$1 \leq mod(\frac{q}{2^{17x+r}},2)$$
Asking if the modulo 2 of a number is greater than 1, amounts to asking if the floor of the number is odd, so the inequality we had before amounts to asking if $\lfloor mod(\frac{q}{2^{17x+r}},2) \rfloor $ is odd.
Now if you notice, our expression is dividing $q$ by a power of two, which if you're familiar with binary arithmetic equates to asking if the $17x+r$th bit of q is a 1.
So $q$ is essentially just the bits of the image we want to display, and the inequality is just a way of mapping the bits of $q$ to positions on the graph $(x,r)$
With an understanding of how the formula works, we can derive a more general formula for an arbitrary resolution, and find an N for any image of our choosing.
$$\frac{1}{2} < \lfloor mod(\lfloor\frac{y}{H}\rfloor2^{-H\lfloor x \rfloor -mod(\lfloor y\rfloor,H)},2)\rfloor$$
# convert a .png image file to a .bmp image file using PIL
from PIL import Image
import numpy as np
col = Image.open("/Users/benjaminyu/Pictures/snoo.png")
gray = col.convert('L')
gray = gray.resize((128,int((float(gray.size[1])*float(128/float(gray.size[0]))))), Image.ANTIALIAS)
# Let numpy do the heavy lifting for converting pixels to pure black or white
bw = np.asarray(gray).copy()
# Pixel range is 0...255, 256/2 = 128
bw[bw < 128] = 1 # White
bw[bw >= 128] = 0 # Black
N = int("".join(map(lambda x: str(x), np.ravel(np.flipud(bw.T)))),2) * bw.shape[0]
print N
So we can encode any arbitrary picture into a number N:
N = 2934623275416912475030491109196223507870408304003274787529399398373409470947725212282507513985364121401993299590057603536651058434430428000713934034262420539588839714855313659476041329252205216284053036373985592156078116097862529167048970032268372106689275943029232884582183802275608406984310060487284077383545964976626049402237758337424825667882109775758119069759711086159475402546442908115604247479467761707595602823277222590174148061721144661562877935837684987331648950202379210419761418529491052519903672658483467127112354245170241000511651371573682684652138489833128947164860555689296831469585401404266519623158869788893484136525119276348088537397595057764763706428854503607822157361565041165736407127934729927206490929693840877341534651602519824327919988642906846941938143188348416246303450729216256492285763699631779017092550576801609733463569141992832078904210560389835678854985222553314212302157089647040770218576400856148712692549612864827859634848567874230553830231008608480551526185331610884023950473395627404510338247990509604688305502857258206169437219817544147597583712195213777653405268756727571883678689585163327871720739436240460669418300306624465337083345340464995381274757531302248790539716766397711852327302990770383523606908739248026364599355756834351397558487059050684928260928838432222518793427292433951013483978017470108122865208180548821988646986231703540291159558903789235893979713420832755414423733836281551853729042926352142592526095601876141580214105543791390293602903481652256017836803468399539265356189557558771422077488754514918578855210716867823312306187344566890350234802433750468728689432768555311875399950666919244002684268574753407813634761601994474156237457500418442033482936308192038219163087024314080549797579988103924908630687122751619154127395573557908674276686814377965651263181751997208887462721343142724195461589133490301517242186511008672593881507803523441566952187960237515387625734553156736149727199481694354773270128522179866731509657925460090139528874458275244429430892452942981756076496885441567514879027361692817842321464584459280173324141908504491156943594276771103722834076808201752686726085560071885872984223087682696920575570456442655477225039976727495199911238224829072394783543050873150280650591819431533272367505877158020727584326980883477180928874605391350817024759871309404085387989405329715957150324979071535589966696213464423087118981480999105097535293942661806511503593473513724116877585440751185516253755737795162786017812931574161592606767535714074215598821686593293532932960264437635558765976389576877509305340221758973065089196415837498445497408525156489886212813684243724323725004685820770019255422439909930727267561273169067878674794273010056125234113530987330376846261079028270380444771832099462250359430408276398220720074274559637984804760436078298796343035087959362643005440997645523579810464973937126682694540083574858141767508134773227942068674148031670131717931418689152363586500900535680051127415533991251403919548661081834899052099337098194392197384935184663026546262344068986801756366808527194969649049755136712360954774314964625405934678972712839590470578211817546424938634849289847854768974075481629606601073700446951475887076345447295918394091512682194830312520818996687689385505716564672250993308451896798645087794301454736179531933752575980180053095262026112430109553088168674158571212573903619901314636516438376802721176516985088450078634272548208727838394836195269009603161045007845222625981683707388914297613369778954556464969666114729890039816788008611987751149284082746550112766208312052209559817611969588323556404633247069961729801774315695724440062396745250675401936097773795213305336722643233740567985176560174300445832865640238382675373274271037606039858106682894696450279221097336576593887033982977033445050701383492878778355548463657547230933917003412117674221164036961378317484903511666429985992600722135242153678097475445447559103223641494220495969919597615352258710185128390228610549474981196081491816244219184591728957745644815545422896276054189162468951290185357853361610430259087161986486053419230293766230560812427982806743750137561941534672666011438561263643205387915597505501012317896658902435472712920787093616146812846754247561670673098256668411055723483723692084567264275093528043826421642341526643798201364109953265902247797332233881512376876905775491832541580680462110631962411604960776572987246841750575786400748930967868158655063870179156433850287070223190824787769375734172851199964269451035740979289566079882401273223261310067061915476632854584196707898910527479163663645295174286326363519525815572197529728287375600807852619655034597605353867924297140123622805070381944205650513716454290134805832443444451157184311419910156868409289554683807983110249242741375974240359369882359906025774629741738831739082242574243835891288583403863249591034478333038474016980047411831632234201557906326705010670843157572860789962076004094863029562686271324313636041475731335624439151784655947259452318740012832109494358173866605847931029270764452184619298435460341718297701175482814567128098509020623654863739197228539946449925577892594667783194019174140177224053906492027814008479465063700700820148751768139656986476815902597185582146284727070605978078894511823945488348445276827366156670303755507474115433582411261056679262368762751023162449086431759226266531174049758164577352632233910284433758042743981099850816544790191658943184252792707349325162613913695560734979746878302087927316458577289480139955269882212245203299184694125165342733137841349671020199734260111420186534472078196943004821011381558040042661294167455749340816356028306882591078444373681567241740290605566131305020533059557126560715499839884768296019624488670465052971862717186589417628954021185483642208527902959990388260257883919179452013110812992396384969965573333053950576661974326883712785363315544323913073109445836071572347480037116995200791994392602132754118135407974028778561916293468963500728098108327821644068498654395825016124076034162558054052345352347491282610738480134818886509558024740458088584546601999237568158790446312078354699975979358000396979699037944957751572234892616739917843244132366528868317702223416594649750526197990843458577361927168473772498652415698180679926392669938648987073048836151249093833335193839379728878750036316072143758592955863478336095411405652084454602034006507413616883789500875620245129785602008478364101446537874787569975931670692042282785420068746656178574460092482966788307063453556499807124890274331421456245864188978294849283300156356709334733056597424787335569463024929057934816845299878638415355316969026583608608203385988139253070906177042047230251597354265667817279953753889216339598949104763900051731096838710295656982208543112900311584701675361794925082663270423598807715933774020079490898414570864642092474729075048448L
And now plotting it:
H = bw.shape[0]
W = bw.shape[1]
def tupper(x,y):
return 0.5 < ((y//H) // (2**(H*x + y%H))) % 2
plt.rc('patch', antialiased=False)
for x in xrange(W):
for yy in xrange(H):
y = N + yy
if tupper(x,y):
plt.bar(left=x, bottom=yy, height=1, width=1, linewidth=0, color='black')
plt.axis('scaled')
I'm not sure why I find this so surprising, but it's a somewhat trivial task to write a program that can spew out somewhat coherent sentences, as long as you have a large corpus to work off.
Markov Chains
Markov Chains are essentially state machines with probabilities assigned to each
]]>I'm not sure why I find this so surprising, but it's a somewhat trivial task to write a program that can spew out somewhat coherent sentences, as long as you have a large corpus to work off.
Markov Chains
Markov Chains are essentially state machines with probabilities assigned to each of it's state transitions.
For example, using the Markov Chain shown above, if it were sunny today, there would be a 70% chance it would be rainy tomorrow, and a 30% chance it would still be sunny. If one were to continue to traverse the chain, they would get a sequence of weather patterns like:
'sunny', 'rainy', 'rainy', 'sunny', ...
If the probabilities were reasonable, the sequence of weather patterns should be reflect your 7-day weather forecast!
The Bot
We can apply this same methodology to generate some tweets! First we need some source material. I had scraped around 110,000 game reviews off Metacritic for another project I was working on.
First we need to split up our corpus into individual tokens. We could split it up at several granularities, like per-character, per-word, or even per-sentence. As with most things, there's a trade-off on both ends. Tokenize by character, and it's you'll get nonsensical gibberish. If you split by sentence, you're essentially just chaining together pieces from your source material. Let's try at the word level for now:
"""
Using NLTK for convenience. You can use split() ¯\_(ツ)_/¯
"""
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
tokenized_content = tokenizer.tokenize(review_file.read())
Now that we've tokenized our corpus, we need to make a decision on how we wan't to represent each state. We can use a single word/token as the state:
to, be, or, not, to, be, …
We can make it a bit more complex and do each pair of words:
to be, be or, or not, not to, to be, …
And pushing it even further, triples:
to be or, be or not, or not to, not to be, …
The technical term for this grouping of tokens is n-grams. Picking the correct n will impact the performance of your model. Pick an extremely large n, and your model will be very biased towards certain sequences. Pick too small an n and your model will spew out crap, since there's so much variability. Tri-grams are usually a good bet for decent sized corpuses, but for smaller ones bi-grams perform better.
Now that we've decided on what the states will look like in our Markov Chain, how do we go about representing it in a data structure? One simple way is to use a Hash Table/Dictionary where the keys are states in the Markov Chain, and the key's value are the possible transitions, represented by an array of keys (which assumes that each transistion has a uniform probability.
def train(self):
for w1, w2, w3 in self.triplets(tokenized_content):
key = (w1, w2)
if key in self.model:
self.model[key].append(w3)
else:
self.model[key] = [w3]
Now that we have our Markov Chain ready to go, we'll just need to start off at some random state, and traverse the chain to generate some tweets!
def generate_tweet(self):
w1, w2 = random.choice(self.model.keys())
gen_words = []
tweet_length = 0
while tweet_length <= 100:
gen_words.append(w1)
tweet_length += len(w1) + 1
w1, w2 = w2, random.choice(self.model[(w1, w2)])
gen_words.append('#GameReview')
return reduce(self.join_tokens, gen_words)
The resulting tweets:
Source can be found here: https://gist.github.com/ben-yu/919e843ac4df8d0fccee
]]>Here's something mind-blowing:
If you take the sum of the first couple of natural numbers you get the triangle numbers:
$$T_{1} = 1$$
$$T_{2} = 1 + 2$$
$$T_{3} = 1 + 2 + 3$$
$$T_{n} = (n)(n-1)/2$$
So essentially the series 1 + 2 + 3 + 4 + ... is the area of an
]]>Here's something mind-blowing:
If you take the sum of the first couple of natural numbers you get the triangle numbers:
$$T_{1} = 1$$
$$T_{2} = 1 + 2$$
$$T_{3} = 1 + 2 + 3$$
$$T_{n} = (n)(n-1)/2$$
So essentially the series 1 + 2 + 3 + 4 + ... is the area of an $nxn$ triangle.
So we let the series be:
$$ c = 1 + 2 + 3 + 4 + ... $$
By multiplication:
$$ 4c = 4 + 8 + 12 + 16 + ... $$
If we then insert some zeros:
$$ 4c = 0 + 4 + 0 + 8 + 0 + ... $$
Then take the difference between the two we get:
$$ - 3c = 1 - 2 + 3 - 4 + 5 - 6 + 7 ... $$
So what does this new series sum up to? We know the geometric series:
$$ \sum\limits_{n=0}^\infty r^n = 1 + r + r^2 + r^3 + ... = 1/(1-r) $$
Taking the derivative with Quotient Rule on the RHS and power rule on the left:
$$ \sum\limits_{n=0}^\infty nr^{n-1} = 1/(1-r)^2 $$
With r=-1:
$$ 1/(1+1)^2 = \sum\limits_{n=0}^\infty n(-1)^{n-1}$$
$$ 1/4 = 1 - 2 + 3 - 4 + 5 - 6 + 7 ...$$
So from earlier we get:
$$ -3c = 1/4 $$
$$ c = -1/12 $$
So
$$ 1 + 2 + 3 + 4 + 5 + ... = -1/12 $$
Of course, there was something wonky with the derivation we just did. Randomly inserting zeros into a divergent series to match the subtraction will cause inconsistent results if you place them in different locations. There are ways to handle that with by constraining where the zero's get placed by depending on another function.
Math is weird.
]]>What better place to get sushi in Tokyo, than by going to an actual fish market? By the time we arrived at Tsukiji Fish Market (around 6:00AM in the morning), the lines were already pretty long. Daiwa Sushi (大和) had a significantly shorter wait. We probably only waited for
]]>What better place to get sushi in Tokyo, than by going to an actual fish market? By the time we arrived at Tsukiji Fish Market (around 6:00AM in the morning), the lines were already pretty long. Daiwa Sushi (大和) had a significantly shorter wait. We probably only waited for around 20 minutes before we were seated.
We decided on the omakase which was seven pieces of nigiri, 6 pieces of maki for ¥3500.
My favourite piece was either the Anago or the Shrimp Head. The quality of the fish was amazing. The Ikuro was suprisingly good, since it didn't have the fishy taste you get so much roe imported here in Canada.
Probably not the best sushi you could find in Tokyo, but it was pretty damn close.
]]>An age-old puzzle that's challenged kids and pancake chefs for ages:
The goal of the puzzle is to move an entire stack of disks from one rod/plate to another with the following restrictions:
An age-old puzzle that's challenged kids and pancake chefs for ages:
The goal of the puzzle is to move an entire stack of disks from one rod/plate to another with the following restrictions:
So how can we go about solving this programatically? The key insight is that this problem has optimal substructure. This means that the solution for a larger problem, say for 4 disks is composed of the solutions for smaller subproblems: 1,2 and 3 disks.
def moveTower(height,fromPole, toPole, withPole):
if height >= 1:
moveTower(height-1,fromPole,withPole,toPole)
moveDisk(fromPole,toPole)
moveTower(height-1,withPole,toPole,fromPole)
def moveDisk(fp,tp):
print("moving disk from",fp,"to",tp)
moveTower(3,"A","B","C")
We first move the first n-1
disks from pole A to B by calling moveTower recursively but for the stack of n-1 disks. Then we can move disk n
from pole A to C. We can then do the same recurisive call and move the n-1
disks from B to C.
The recursive solution seems a bit expensive, with the two recursive calls leading to a $O(2^n)$ runtime. Can we do a bit better?
Our recursive search right now is solving the same subproblem over and over again. If we saved or memoized the result, we could possibly trade off some of our runtime by using more memory and essentially caching the solutions to each subproblem.
memo = []
lookup = ["ABC","ACB","BAC","BCA","CAB","CBA"]
def topDown(height,fromPole, toPole, withPole):
global memo
memo = [[-1 for x in range(6)] for y in range(height+1)]
memo[0][0] = ""
for x in range(6):
memo[0][x] = ""
return topDownHelper(height,fromPole, toPole, withPole)
def topDownHelper(height,fromPole, toPole, withPole):
global memo
move = fromPole + toPole + withPole
index = 0
while move != lookup[index]:
index += 1
if memo[height][index] < 0:
memo[height][index] = topDownHelper(height-1,fromPole,withPole,toPole)\
+ fromPole +'->'+ toPole +', ' \
+ topDownHelper(height-1,withPole,toPole,fromPole)
return memo[height][index]
print topDown(3,"A","B","C")
Another approach would be to intelligently build up our solution, solving for 1 disk, then 2 disks, etc... By going in the opposite direction of our recursive solution, we save time by not solving the same subproblems. This is a technique dubbed dynamic programming. Since we are building up our solution successively, we only need to remember the solutions that are needed immediately. This saves us some valuable space.
moves = ["A->B","A->C","B->A","B->C","C->A","C->B"]
prevLookup = [1,0,3,2,5,4]
prev2Lookup = [5,3,4,1,2,0]
def bottomUp(height,fromPole, toPole, withPole):
prev = ["" for x in range(6)]
curr = ["" for x in range(6)]
for i in range(height):
for j in range(len(moves)):
prevStr = prev[prevLookup[j]] + ', ' if prev[prevLookup[j]] != "" else ""
nextStr = ', ' + prev[prev2Lookup[j]] if prev[prev2Lookup[j]] != "" else ""
curr[j] = prevStr + moves[j] + nextStr
curr,prev = prev,curr
return prev
print bottomUp(4,"A","B","C")[0]
Doing a quick check on each implementation's running time, we see that the recursive solution explodes exponentially, while our memoized/dynamic programming implementations grow linearly. Clearly dynamic programming is the way to go!
]]>The dragon curve or Heighway Dragon is a beautiful self-similar fractal first investigated by NASA physicists John Heighway, Bruce Banks, and William Harter.
]]>Jack [Heighway] came into my office (actually cubicle) and said that if you folded a one dollar bill repeatedly he thought it would make a random walk
The dragon curve or Heighway Dragon is a beautiful self-similar fractal first investigated by NASA physicists John Heighway, Bruce Banks, and William Harter.
Jack [Heighway] came into my office (actually cubicle) and said that if you folded a one dollar bill repeatedly he thought it would make a random walk or something like that. (We’d been arguing about something in Feller’s book on return chances.) I was dubious but said “Let’s check it out with a big piece of paper.” (Those were the days when NASA could easily afford more than one dollar’s worth of paper.) Well, it made a funny pattern alright but we couldn’t really see it too clearly. So one of us thought to use tracing paper and “unfold” it indefinitely so we could record (tediously) as big a pattern as we wanted. But each time we made the next order, it just begged us to make one more!
As they folded, and unfolded, a pattern emerged:
When you made a single fold you get a single right turn:
Right
After a second fold, you would get:
Right Right Left
And the third:
Right Right Left Right Right Left Left
Fourth:
Right Right Left Right Right Left Left Right Right Right Left Left Right Left Left
The sequence of turns are actually following a specific pattern:
$$S_{n+1} = S_n R \bar{S_n}$$
where $\bar{S}$ is the same sequence but reversed in direction and you replace L's with R's and vice-versa. The $n+1$'th fold appends the $R$ then the subsequent folds are the same sequence but reversed in order and direction.
The animation on the sidebar uses this sequence to iteratively trace out the curve:
function dragonCurveIter(seq,iterations) {
// 0 - L , 1 - R
for(var i = 0; i < iterations; i++) {
var a = seq.reverse().map(function(x){return Number(!x)})
seq.reverse().push(1);
seq = seq.concat(a);
}
return seq
}
Interestingly enough, this recursive definition can also be interpreted geometrically. Appending $R\bar{S_n}$ is equivalent to having the original curve rotated by 90 degrees about the endpoint of the curve.
]]>MTA provides a public dataset of turnstile data per station since May 2010. We can grab weather data from Weather Underground, who kindly provides historical weather data in CSV format. You can also grab the same data through their API.
The data we collected so far usuable yet,
]]>MTA provides a public dataset of turnstile data per station since May 2010. We can grab weather data from Weather Underground, who kindly provides historical weather data in CSV format. You can also grab the same data through their API.
The data we collected so far usuable yet, so we'll have to do a bit of data munging.
The MTA data has multiple entries per row, so we should first unwind the data so we get an entry per row:
import csv
import os
with open("masterfile.csv", 'wb') as outfile, open("data/maynycweather.csv", 'wb') as weather:
outWriter = csv.writer(outfile)
outfile.write('C/A,UNIT,SCP,DATEn,TIMEn,\
DESCn,ENTRIESn,EXITSn\n') # column names
for name in os.listdir('./turnstile'):
with open('./turnstile/' + name, 'rb') as infile:
inReader = csv.reader(infile)
for row in inReader:
for i in xrange(3,len(row),5):
outWriter.writerow(row[0:3] + row[i:i+5])
For simple manipulations, the csv
module will usually suffice. However, doing more complicated operations, such as calculating aggregates or dealing with different datatypes like DateTimes can be difficult and tedious. This is where pandas
saves the day!
import pandas
import datetime
def reformat_weather_dates(date):
return datetime.datetime.strptime(date,\
'%Y-%m-%d').strftime('%Y-%m-%d')
def reformat_subway_dates(date):
return datetime.datetime.strptime(date,\
'%m-%d-%y').strftime('%Y-%m-%d')
df = pandas.read_csv('masterfile.csv')
dfweather = pandas.read_csv('data/nyc052013.csv')
df = df[df['DESCn'] == 'REGULAR'] # Filter by REGULAR
df['ENTRIESn_hourly'] = (df['ENTRIESn'] - df['ENTRIESn'].shift(1)).fillna(1) # Calculate daily entries
dfweather.rename(columns={'EDT':'DATEn'}, inplace=True) # - rename column names so they can merge
df['DATEn'] = df['DATEn'].map(reformat_subway_dates) # - reformat so dates match using map
dfweather['DATEn'] = dfweather['DATEn'].map(reformat_weather_dates)
final = pandas.merge(df,dfweather,on='DATEn')
final.to_csv('final_master.csv')
We can easily filter by specific categories. For example, we'll only want turnstile data from the category of 'Regular':
df = df[df['DESCn'] == 'REGULAR'] # Filter by REGULAR
This grabs all the indices of rows that match 'REGULAR', then regrabs the rows from the original frame.
# Calculate daily entries
df['ENTRIESn_hourly'] = (df['ENTRIESn'] - df['ENTRIESn'].shift(1)).fillna(1)
One way to get a sense of our data is to visualize it. Here, we'll try out ggplot, a port of the popular R graphing library.
Entries seem to peek during specific two-hour windows. This seems to correspond to peak operating times, like rush hour (8-9am and 4-5pm), lunch and dinner etc...
We can also look at comparing ridership relative to weather, like how many people exit stations relative to the average humidity.
from pandas import *
from ggplot import *
df = pandas.read_csv('./turnstile_data_master_with_weather.csv')
df['meandewpti'] = df['meandewpti'].map(lambda x: round((x-32.0)*(5.0/9.0),0))
daily = df.groupby(df.meandewpti).EXITSn_hourly.sum()
daily.index.name = 'day'
daily = daily.reset_index()
p = ggplot(daily, aes('day', weight='EXITSn_hourly',alpha=0.5)) + \
geom_bar(fill="green") + \
theme_xkcd() + \
ggtitle("May 2011 - Turnstile Exits by Dew Point") + \
xlab("Degrees Celsius") + \
ylab("# of Exits")
print p
Don't think we can infer much from the plot, except most the most popular days in May had a humidity around 14-16°C.
]]>Inspired by this impressive piece of engineering, I tried to write my one quine in CoffeeScript.
The trick I used was to encode the entire file into decimal ASCII and store it in the data array. To rebuild the actual code, you convert it back to characters. This lets you
]]>Inspired by this impressive piece of engineering, I tried to write my one quine in CoffeeScript.
The trick I used was to encode the entire file into decimal ASCII and store it in the data array. To rebuild the actual code, you convert it back to characters. This lets you store the code, and run it too!
]]>